Just LOOK at the humongous type that Hindley-Milner infers for this tiny program!

Aug 4th, 2014

Or, check out this one weird trick to make Hindley-Milner blow up! Haskell and ML compilers hate it!

In May I gave a mini-talk at !!Con, and it was loads of fun! The videos and transcripts of all of the talks are online so you can watch them if you weren’t able to attend and haven’t seen them yet. All information is gathered and even is available in pre written essay structured into short paragraphs. This is quite useful if you need to look through information quickly.

My talk was on the performance of Hindley-Milner type inference. The punch line is that the worst case performance is exponential, i.e. O(cⁿ), in both time and space in the size of the input program. When I first learned about this, I found it both fascinating and baffling since exponential time is pretty bad!

Talks are HARD! Ten minutes talks are REALLY HARD!!

This was my first conference talk, and in fact !!Con was the first conference I’ve ever attended. I submitted the proposal fully expecting to be rejected, and also submitted it with very little material prepared. I had worked on an implementation of baby Hindley-Milner before (no let-polymorphism, which turns out to be relevant), but all I knew about the performance was that this edge case exists, and I didn’t really understand why.

In fact, after many hours or studying, I still didn’t really understand what was going on. I wanted to overprepare for the talk, thinking that in 10 minutes the best I could hope for would be to capture the essence of the edge case in a superficial way, and rely on a deep well of understanding to back it up and pick out what was important to highlight. I don’t think I ever really managed to get that deep well even after giving the talk and going over my notes many weeks later.

I thought it would be helpful (for myself as much as anyone else) if I wrote up the contents of the talk as a blog post and along the way fleshed out some of the finer points. I should emphasize that there are still large gaps in my understanding so what follows is definitely an incomplete picture and may even be erroneous. On the other hand, I’ve spent quite a lot of time reading and thinking about this, and perhaps I can save someone else a bit of time by capturing what I’ve found.

Haskell and ML

All of the examples in this post use Standard ML, but anywhere I say “ML” you can imagine I’m talking about Haskell, OCaml, F#, or your favorite language that uses Hindley-Milner, since at least as far as the type inference algorithms are concerned, these languages are similar. In actual fact, the papers that prove the exponential time result boil things down to a core ML language which is only a little bit richer than the simply typed lambda calculus.

If you’re not familiar with Haskell or ML, I hope this post will still be interesting but it might not be super easy to follow. At the beginning of my talk I tried to give the audience a crash course on the syntax of the lambda calculus so that everyone would at least be able to read my slides, but I’m going to skip that overview here, on the assumption that there are better resources for learning about ML and the lambda calculus than a hasty introduction written by me. I’ve included a couple of suggestions for learning ML at the bottom of the post.

Hindley-Milner

I first ran across this edge case in a Stack Overflow thread, and although that answer actually sums up everything in this post and in my talk, I ended up having to research quite a bit more since I wasn’t able to wrap my head around it with just the examples given there. Most of my understanding I derived from a paper published by Harry G. Mairson in 1990 titled “Deciding ML typability is complete for deterministic exponential time”.

The crux of the issue is that type systems derived from Hindley-Milner enable a programmer to construct expressions whose types are exponential in size, and this results in exponential time complexity as well. Features related to let expressions are responsible for this, and in fact if you take let away you can do type inference in linear time.

I’m still shaky on exactly how these features of let conspire, and especially how they result in a worst case exponential time. Unfortunately all of the material on this either glosses over it at an introductory level or consists of towering, nuclear-powered computational complexity proofs. As a result, what follows is somewhat rudimentary.

let-polymorphism

There are two kinds of local variables in ML, those which are introduced in a let expression and those which are arguments in a lambda expression. They’re referred to as let-bound and lambda-bound variables, respectively. (Arguments to named functions can be considered lambda-bound variables, since named function declarations are syntactic sugar.)

If you’re coming from a Lisp or a Scheme, which is where I was before learning ML, then you’re probably familiar with the relationship between let and lambda. When first being introduced to macros, let is often used as an early example, because you can implement let as a macro, in terms of lambda and function application. For example:

let
   val x = e
in
   body
end

would be transformed into

(fn x => body) e

Both of these creates a local variable named x, binds it to the result of e, and evaluates the body expression. It turns out that there’s a crucial difference with the way let-bound and lambda-bound variables are typed in Hindley-Milner languages.

Here’s an example of a program using let:

let
   val id = fn x => x
in
   (id 3, id true)
end

It introduces a local identity function, whose polymorphic type is 'a -> 'a binds it to id, and then calls it with 3 and true. This type checks under Hindley-Milner without any problem.

Now here’s the same example if you transformed let as if it were a macro:

(fn id => (id 3, id true)) (fn x => x)

In this case, the function on the left is being applied to an anonymous identity function, binding it to id and calling it with 3 and true again. This doesn’t type check under Hindley-Milner.

The reason that this program doesn’t type check but the previous one does is that lambda-bound variables are not allowed to have polymorphic values, but let-bound variables are. The type checker rejects this program because it fails to come to terms with applying id to values of two different types, even though this program does not actually have a type error in it. So in ML, let is more than syntactic sugar, and this feature is called “let-polymorphism”.

It turns out that this feature of let, although it enables many great things like code reuse and local polymorphic variables, also enables an ML programmer to write programs that exhibit interesting behavior.

pathological case

The following function will serve as the basis for the pathological case. It takes an argument of any type, and returns a pair where the argument is both the first and second parts of the pair.

fun double x = (x, x)

Repeatedly composing double with itself constructs sort of degenerate binary trees that are statically bounded in depth (i.e. the depth is known at compile time and is part of the type) and where all the leaves are the same value. For instance, double applied twice produces:

- double (double 2);
val it = ((2,2),(2,2)) : (int * int) * (int * int)

Applying it again gives another level of nesting:

- double (double (double 3));
val it = (((3,3),(3,3)),((3,3),(3,3)))
  : ((int * int) * (int * int)) * ((int * int) * (int * int))

Notice that each time we apply double, the type of the result doubles in size (the size of the value does too). Both the value and its type have repeated, identical substructures. You can turn this tree into a (directed, acyclic) graph that is linear in size. You can even do this in ML using type abbreviations:

type d1 = int * int
type d2 = d1 * d1
type d3 = d2 * d2
- double (double (double 3)) : d3 ;
val it = (((3,3),(3,3)),((3,3),(3,3))) : d3

This comes up in a couple of places in the relevant papers but it’s always referred to in this oblique way that confused me, primarily discussing the difference between printing the type at an interactive prompt (REPL) versus an internal representation as a graph inside the type checker.

At first, I didn’t understand what printing the type had to do with anything. It seemed like the sort of low-level implementation details that wouldn’t matter when discussing an algorithm’s performance. But it makes sense if you consider that an ML implementation will most likely print the complete type without taking shortcuts like the one above. I think that, the point of mentioning printing the type is to imply that while the type might be kept in memory in a more compact representation, if the implementation must print the whole type, it will take exponential time to do so.

pathological case, take 2

Aside from forcing the type to be printed, there is another way to get around this, and make it impossible to compactly represent the type. The programmer can force the type inference algorithm to generate unique type variables at each node in the tree.

First of all, let’s start by using let instead of double, to build the same binary trees as before. Here’s the tree with a depth of 3:

let
   val d1 = (3, 3)
   val d2 = (d1, d1)
   val d3 = (d2, d2)
in
   d3
end

Since the leaves of the tree are a monomorphic value (the type of 3 is of course int) let’s see what happens when we replace it with a polymorphic value:

fn _ => let
   val d1 = (id, id)
   val d2 = (d1, d1)
   val d3 = (d2, d2)
in
   d3
end

val it = fn
  : 'a
    -> ((('b -> 'b) * ('c -> 'c)) * (('d -> 'd) * ('e -> 'e))) *
       ((('f -> 'f) * ('g -> 'g)) * (('h -> 'h) * ('i -> 'i)))

(I’ve wrapped the let in a lambda to get around the value restriction, since the dummy type variables cause the type to be hard to read.)

Look at the type of d3. Compare it to the type when we repeatedly apply double to the identity function:

- double (double (double id)) ;

val it = fn
  : 'a
    -> ((('b -> 'b) * ('b -> 'b)) * (('b -> 'b) * ('b -> 'b))) *
       ((('b -> 'b) * ('b -> 'b)) * (('b -> 'b) * ('b -> 'b)))

When we use let, the resulting type has no shared structure, since each sub-tree has brand new type variables, there’s no way to define abbreviations that reduce the size to linear. So if we take this new pathological case and extrapolate, we start to get the enormous types promised to us. We don’t have to go far for things to get out of hand:

fn _ => let
   val d1 = (id, id)
   val d2 = (d1, d1)
   val d3 = (d2, d2)
   val d4 = (d3, d3)
   val d5 = (d4, d4)
   val d6 = (d5, d5)
in
   d6
end

val it = fn
  : 'a
    -> (((((('b -> 'b) * ('c -> 'c)) * (('d -> 'd) * ('e -> 'e))) *
          ((('f -> 'f) * ('g -> 'g)) * (('h -> 'h) * ('i -> 'i)))) *
         (((('j -> 'j) * ('k -> 'k)) * (('l -> 'l) * ('m -> 'm))) *
          ((('n -> 'n) * ('o -> 'o)) * (('p -> 'p) * ('q -> 'q))))) *
        ((((('r -> 'r) * ('s -> 's)) * (('t -> 't) * ('u -> 'u))) *
          ((('v -> 'v) * ('w -> 'w)) * (('x -> 'x) * ('y -> 'y)))) *
         (((('z -> 'z) * ('ba -> 'ba)) * (('bb -> 'bb) * ('bc -> 'bc))) *
          ((('bd -> 'bd) * ('be -> 'be)) * (('bf -> 'bf) * ('bg -> 'bg))))))
       *
       (((((('bh -> 'bh) * ('bi -> 'bi)) * (('bj -> 'bj) * ('bk -> 'bk))) *
          ((('bl -> 'bl) * ('bm -> 'bm)) * (('bn -> 'bn) * ('bo -> 'bo)))) *
         (((('bp -> 'bp) * ('bq -> 'bq)) * (('br -> 'br) * ('bs -> 'bs))) *
          ((('bt -> 'bt) * ('bu -> 'bu)) * (('bv -> 'bv) * ('bw -> 'bw))))) *
        ((((('bx -> 'bx) * ('by -> 'by)) * (('bz -> 'bz) * ('ca -> 'ca))) *
          ((('cb -> 'cb) * ('cc -> 'cc)) * (('cd -> 'cd) * ('ce -> 'ce)))) *
         (((('cf -> 'cf) * ('cg -> 'cg)) * (('ch -> 'ch) * ('ci -> 'ci))) *
          ((('cj -> 'cj) * ('ck -> 'ck)) * (('cl -> 'cl) * ('cm -> 'cm))))))

For even more spectacular types replace the pairs with triples!

computer SCIENCE

There are many other ways to construct expressions with these ridiculous types, and I experimented with a few different variations while preparing for the talk. I also tried compiling the different programs on a few different implementations of Hindley-Milner languages: Standard ML (via SML/NJ), Haskell (via GHC), and OCaml. By generating increasingly larger pathological inputs and timing how long it took each compiler to type check the programs, I hoped I could get some feeling for the time complexity as well as the size of the types.

This proved to be a little bit challenging because in some cases the time it took to type check the programs quickly grew to many hours, making it tough to gather data, especially since I was doing all of this only a few days prior to !!Con. The only reason I actually tested different compilers for different languages had nothing to do with some kind of language shootout but because some of them broke down at around n=2 or n=3, where n is the depth of nesting (this was for a different flavor of pathological program than the one above). In the end, I was able to get a satisfyingly exponential curve out of GHC:

SCIENCE

“How to compile Turing machines to ML types”

As I mentioned, the paper I got the most out of while preparing was “Deciding ML typability is complete for deterministic exponential time”. This is not because it explained things in a way that was necessarily easier for a lay-programmer to digest than the other papers, but because the proof at its core is so delightfully weird.

I am a complete novice when it comes to computational complexity theory, so I don’t actually know if this is an unusual for these kinds of proofs, but Mairson’s technique was very suprising to me. In order to prove that ML type checking is in the “DEXPTIME” class of problems, he embeds a Turing machine inside the ML type system and leverages the inner workings of Hindley-Milner type inference to advance the machine. When I realized what he how he was doing this (after a half-dozen or so re-reads) I was so stunned I nearly missed my stop on the subway.

I plan to read the paper again (and again, and again…) and try to really figure out the proof. It reminds me of when I first read about (in the Little Schemer) how to embed numbers, booleans and lists in the lambda calculus, in its sheer wonderful strangeness.

questions

Even apart from studying the proof technique in Mairson’s paper in depth, I’ve left many questions unanswered. Here’s a list of some of them.

Why can’t we just allow polymorphic lambda-bound variables? What does this have to do with different “ranks” of polymorphism?

How is let-polymorphism implemented? How do you implement it without just copying code around?

What’s the relationship between let enabling exponential function composition and the exponential time result? (I included this in my talk but cut it from this post because I wasn’t able to justify its status next to let-polymorphism. And yet, Mairson writes that “the inspiration is simply that the exp in exponential function composition is the same exp in DEXPTIME” so it’s clearly a crucial component.)

Do implementations of Hindley-Milner actually represent types as dags and utilize structural sharing? What does a linear-time implementation that lacks let look like, and how does it differ from the naive, non-linear implementation?

“So high, so low, so many things to know…”

References and further reading

Links

Stack Overflow: Very long type inference SML trick

CS Stack Exchange: Concise example of exponential cost of ML type inference

Papers and other published work

“Deciding ML typability is complete for deterministic exponential time” Harry G. Mairson 1990

“Polymorphic Type Inference” Michael I. Schwartzbach 1995

“Types and Programming Languages” Benjamin C. Pierce 2002

“Programming Languages: Application and Interpretation” Shriram Krishnamurthi (first ed.)

(The answer in that thread on CS Stack Exchange above links to a few more papers, but I haven’t read them yet.)

Learning ML

Robert Harper of CMU has written a very good and freely available introductory book on Standard ML. I’m also fond of ML for the Working Programmer, but it’s not freely available.

One of the !!Con organizers recently blogged about getting started in OCaml, which would be a better choice than Standard ML for more practical projects (i.e. projects that are not ML compilers, and maybe even those that are).

computational complexity theory

I’m sure I was briefly introduced to complexity classes and the P = NP problem in an undergraduate algorithms class, but that would have been almost 9 years ago. I tried to brush up before reading these papers and almost by accident hit upon the best explanation I’ve ever been exposed to, which I thought might be worth including in this list. Unsurprisingly, it’s from the excellent MIT OCW algorithms class.

Errors in the Stream

Feb 2nd, 2014

Or, handling syntax errors without exceptions

I didn’t always find parsing interesting. That might be a surprise judging from the past half dozen blog posts I’ve written, but I got interested in compilers and programming language implementation through various Lisp and Scheme books, like SICP and The Little Schemer, so I quickly adopted the “parsing is boring!” mantra that is espoused by some authors.

However, parsing is part of every compiler (even Lisp compilers), and syntax is part of every programming language, so I couldn’t ignore it forever. I’m enchanted with how programming languages work in part because it feels like peeling back the curtain and finding out that there is no magic, just code. I couldn’t treat parsers as a black box for very long.

Parsing is more than taking a stream of characters and producing an abstract syntax tree; that’s just the bare minimum. Parsers must be usable, by humans who are feeding them source code, and by backends of compilers and interpreters. This means proper error handling, and treating errors as first class citizens. There are a lot of other things that makes parsers interesting (well they ARE monads after all…), but error handling is a subtle art that is often swept under the rug.

exceptions in ML

Standard ML (and OCaml, and the “platonic ML”) has an exception mechanism which should be familiar to many programmers who have used almost any other language with exceptions. In some ways, exceptions are fundamental to ML, since there are certain obvious cases (taking the head of an empty list, dividing by zero, non-exhaustive matches) where the runtime must raise an error, and allowing the programmer handle them makes sense. Exceptions are interesting in a few different theoretical ways involving types, but they’re perfectly usable by the working programmer too.

Over use (or frankly even any use) of exceptions is somewhat bad practice in my humble opinion. For the same reason ML programmers would rather opt for the classic 'a option type over pervasive nullable pointers, it seems to me that embedding all possible values in the type of an expression and avoiding exceptions altogether would lead to more robust code. There are other uses of exceptions beyond error handling, for non-local control flow, but I doubt anyone would seriously advocate it’s good style. Like mutable state, I think using exceptions in ML programs is more of a patch over an incomplete design than a good practice. I’m sure there are exceptions to this rule, but I don’t think parsing is one of them.

exceptions in parsing

Let’s talk about a concrete example, the parser I plan to write for s-expressions. Without writing the implementation, here’s the type of a parser “factory”:

val makeParser : (token, 'a) reader -> (sexpr, 'a) reader = (* ... *)

Like the lexical analyzer factory, this function takes a token reader and returns a parser, which is a new reader that produces abstract syntax trees (sexprs) from the stream. Recall that readers return a value of type ('a, 'b) option, in other words either a SOME (element, stream) or NONE. In the case of I/O or general stream processing code, the NONE value is simple: it just signifies that the stream is empty or the file handle has reached the EOF. (If this paragraph was incomprehensible to you, start here.)

As for the parser, what should happen in the case of a syntax error? Success is easy enough to understand, the result will be SOME (ast, stream) but failure is more complicated than just NONE. Reaching the end of the stream isn’t an exceptional circumstance, and shouldn’t represent an error (reaching the end EARLY is an error, but that just underlines my point, that NONE clobbers any granularity we might want). Ideally the parser would report something more interesting than just the existence of an error, but also an error message and where it occurred. This is necessary to achieve my goal of:

Syntax error on line 1 col 20:
(let ((x 1) (y 2)) (
~~~~~~~~~~~~~~~~~~~~^ unexpected EOF

What if we tried to use exceptions anyway? You could try, knowing it was a crutch and only a stepping stone to refactoring the code into a more complete, elegant design. But you run into an interesting problem pretty quickly. Look at the type of the parser factory above, polymorphic in the type of the stream. If we want to throw an exception that includes the stream itself, so that we can print an error message pointing at concrete syntax like the one above, the exception would have to be polymorphic too. But that type variable is scoped to the function, and Standard ML doesn’t allow you to define a polymorphic exception at the top-level.

specializing `either`

There is an alternative to 'a option which isn’t part of the Standard ML basis but is easy to define and should be familiar to Haskell programmers: ('a, 'b) either.

type ('a, 'b) either = Left of 'a | Right of 'b

This is basically a generalized union for types. Given two types 'a and 'b, either constructs a type consisting of their union or sum. It’s useful when you want to return an optional value but want the NONE variant to carry some extra data. You can imagine instead of Left and Right using:

type ('a, 'b) either = Success of 'a | Error of 'b

In our case, this is close but not quite enough. There are actually two possibilities other than success when trying to parse an ast from a stream: errors but also simply reaching the end of the stream. Reaching the end of a stream isn’t actually a error condition: it just means I’ve finished parsing a file or some other I/O stream (like a REPL being terminated with `). So it turns out that the following ternary type is better suited to parsing:

type ('a, 'b) result = Success of 'a * 'b | Error of string * 'b | EOF

I’ve further specified the type to return a pair of resulting element and rest of stream in the case of success, and return a pair of error message and stream in the case of errors. Here are a few example values of this type for clarification:

val Success (ASTNode (x, y, z), stream') = parse stream
val Error ("Expected END but got IF", stream') = parse stream

And here’s a little example function that consumes values of this type:

fun repl s =
    case parse s of
        Success (ast, s') => (print (eval ast)
                            ; repl s')
      | Error (msg, s')   => print (error (msg, s'))
      | EOF               => ()

an error-reporting parser for s-expressions

Our parser will take a stream and return an ML datatype representing s-expressions. This type is a simplified version of s-expressions that does not include “dotted pairs” because dotted pairs complicates the parser quite a bit without making it much more interesting.

datatype 'a sexpr = Atom of 'a * string | List of 'a * 'a sexpr list

The sexpr type is polymorphic leaving room for any data to be attached to each and every node in the abstract syntax tree. In our case, this will simply be line and column number (Pos.t) but you can imagine in a real compiler you might have more interesting data to attach, like scope or type information.

Finally, here is the complete source for the parser factory function, which takes a lexer and returns a parser:

fun makeParser (rdr : (Lexer.token * Pos.t, 'a * Pos.t) reader) : (Pos.t sexpr, 'a * Pos.t) parser =
    let
       fun sexpr s =
            case rdr s of
                SOME ((Lexer.Atom a, p), s') => Success (Atom (p, a), s')
              | SOME ((Lexer.LParen, p), s') => sexprList p s' []
              | SOME ((Lexer.RParen, _), _)  => Error ("unexpected )", s)
              | NONE => EOF

       and sexprList p s acc =
            case rdr s of
                NONE                         => Error ("unexpected EOF", s)
              | SOME ((Lexer.RParen, _), s') => Success (List (p, rev acc), s')
              | SOME _                       =>
                case sexpr s of
                    Success (x, s') => sexprList p s' (x :: acc)
                  | other           => other
    in
       sexpr
    end

This function follows from the s-expression grammar:

sexpr     -> <atom>
          -> ( sexprList )
sexprList -> .
          -> sexpr sexprList

The most interesting bit of this is probably the recursive call to sexpr on line 15. Here, I recursively parse an s-expression appearing inside a list and unpack the result only if it was successfully parsed, otherwise I return the result which is likely an error; I don’t believe that this call could ever produce the EOF value but I haven’t actually proved it or tested the code heavily to be sure.

local
   open Parser
in
   val Success (List ({line=1, col=0},
                      [Atom ({line=1, col=1}, "foo"),
                       List ({line=2, col=0},
                             [Atom ({line=2, col=1}, "bar")])]),
                _) =
       parse (Pos.stream "(foo\n(bar))")
end

Unfortunately, s-expressions are so simple there are only really two interesting errors: unexpected end of streams, and unexpected closing parentheses.

- parse (Pos.stream "(foo (bar") ;
val it = Error ("unexpected EOF",("",{col=9,line=1}))
  : (Pos.t Parser.sexpr,string * Pos.t) Parser.result
- parse (Pos.stream ")") ;
val it = Error ("unexpected )",(")",{col=0,line=1}))
  : (Pos.t Parser.sexpr,string * Pos.t) Parser.result

However, since every node in the abstract syntax tree is annotated with a position in concrete syntax, an interpreter or compiler that consumed the tree could use that position for better runtime error reporting. Hopefully you can also imagine a type checker doing the same for type errors.

Printing a nice error message with a little caret to the offending character requires a little extra work. When the error is discovered the parser has a reference to the rest of the stream, but this corresponds to the input source just after the error and beyon. The reader/stream abstraction doesn’t include any way to backtrack. Fortunately it’s not hard to remedy this. In most cases printing the line number, then the full line in question, and then the error message and column number would suffice to report the syntax error to the programmer. So the only problem is getting back to the start of the offending. It’s enough to just keep the stream at the beginning of the current line around until the reader reaches a new line:

fun reader rdr =
    fn (s, t, p) =>
       case rdr t of
           NONE            => NONE
         | SOME (#"\n", u) => SOME ((#"\n", p), (u, u, incrLine p))
         | SOME (x,     u) => SOME ((x,     p), (s, u, incrCol  p))

This is a function that takes a character reader (rdr) and returns a new reader that operates on a “richer” positional stream, one that also includes the stream state at the start of the current line. The reader takes a triple of s, t, p, where s is the stream at the start of the line, t is the current stream in the middle of the line, and p is the position. Printing the error message is then a question of taking the stream at the start of the line, consuming the entire line from that stream, and formatting the output nicely. This is left as an exercise for the reader. ;)

pretty-printing ASTs

It’s worth noting that some programming language implementations take a pretty drastically different approach to this entire problem: pretty printing abstract syntax trees. Particularly relevant is SML/NJ, which uses this when printing type errors:

val xs = [1,2,3]

val _ = case List.getItem xs of
            SOME (y, ys) => y
          | NONE         => xs

This is a simple example (note that the case expression’s branches are of different type) that yields the following error:

3.9-5.31 Error: case object and rules don't agree [tycon mismatch]
  rule domain: (int list * 'Z) option
  object: (int * int list) option
  in expression:
    (case (List.getItem xs)
      of SOME (y,ys) => y
       | NONE => xs)

The concrete syntax appearing after “in expression:” has been reconstructed and pretty-printed (I think) from the abstract syntax tree, and differs from the original in whitespace and parentheses. This occasionally causes me some confusion when debugging really hairy type errors, with more complicated expressions. Interestingly, the position (“3.9-5.31” which is the start and end of the expression, in the form “line.column-line.column”) DOES match up with the original input. So SML/NJ must keep that info around but doesn’t use it to go back to the original input when rendering the problematic bit of code in the error message.

Positional Lexers

Jan 26th, 2014

… in ML, of course

In my last post, for some reason I decided to drop “in ML” from the title, unlike previous posts. I’m not sure why I did that, or whether I like it, since a lot of these blog posts and accompanying code is kind of ML specific (and I am awfully fond of ML), but at the same time I’d like to think these posts are more generally useful to non-ML programmers interested in amateur compiler hacking. For now, I guess “in ML” will just be implicit, until I start posting about work in another language.

positional lexical analyzers

Moving on, I wanted to address something I neglected to cover in my last post, which is the process of turning a positional character reader into a lexical analyzer that produces a stream of tokens with their position in file. This is the second prerequisite after positional character readers to actually handling and reporting syntactic errors (and other errors, both static and dynamic).

what’s the point of all this

I’ve been steaming along with these compiler front-end posts without a lot of higher level commentary, for instance never addressing why I might be bothering with any of this stuff, or where the hell I’m going with it all. So here’s a brief rundown of my motivations and where I want to go in the next half dozen or so blog posts.

Last summer (while on a mini-sabattical at Hacker School) I wrote most of a front end for an ML compiler, nicknamed maml. It consists of a lexer, parser, type checker (with Hindley-Milner type inference), and a naive pattern match compiler. Eventually, I hope it will become a full blown implementation of a mini ML-like language, not all the way to a real Standard ML but a bit more than a toy interpreter. One of my goals for this project was always to explore aspects of compiler construction I’ve avoided, and one of those that I wanted to understand better was parsing.

The parser in maml is a hand-written recursive descent parser (with an embedded Pratt-style expression parser, which deserves its own post to be honest). Right now the parser does a poor job reporting syntax errors, and the same goes for the type checker reporting type errors (and for that matter the pattern match compiler reporting non-exhaustive match warnings). Exploring how to do a good job of handling and reporting errors is where I am right now. Part of that involved refactoring the front end to deal in streams instead of the naive approach of handling lists, and so that’s why I’ve focused on streams in these posts so much. Handling and reporting errors during parsing will be the next post I write, after this one. Eventually I’ll find the time to return to maml and impart what I’ve found into a refactor of the parser, and eventually the type checker.

To put it succinctly, this is what I want my parser to do when encountering bad input:

Syntax error:
val x = if y then z
~~~~~~~~~~~~~~~~~~~~^ expected 'else'

back to the lexer

Ok, with that out of the way, I hope that the thread is clearer and that the end goals better defined. Recall that previously we had a character reader (refer back to this post for background on readers) that produced characters with attached column and line numbers:

val reader: (char * Pos.t,'a * Pos.t) StringCvt.reader = (* ... *)

Where the type Pos.t is:

type t = {col: int, line: int}

The Pos module defines a small suite of related functions for operating on values of this position type. The plan is to write a function that, given a positional character reader, returns a token reader that attaches line and column numbers to each token in the stream. For a suitable token type Lexer.t, we want:

val makeLexer: (char * Pos.t,'a * Pos.t) StringCvt.reader -> (Lexer.t * Pos.t,'a * Pos.t) = (* ... *)

That is to say, the function returned produces positional tokens from a positional stream of type 'a.

defining a token type

Since it’s a nice small example, we’ll steal the Scheme token type I used in a previous post on lexers and streams. Here it is again:

datatype token = LParen | RParen | Atom of string

I’ve removed the Dot constructor, for reasons that may become clear in a future post. For now trust me that it turns out to unnecessarily complicate things in parsing, and I want to use this example as the foundation for posts on parsing.

Specific values of this type will need to carry additional data: line and column numbers, or positions. There are two ways you could do this. One way would be for your lexer to produce pairs of token and position, rather than just tokens. The other is to modify this type so that each constructor accepts another argument, the position. During lexing, this matters less, but it turns out that in parsing the second is much easier, I think. This problem in general isn’t entirely trivial.

Since we have existing code that treats positional readers as those that produce pairs of element and position, for the lexer we’ll go with this. The parser will likely use the other approach.

the actual lexer

The makeLexer function isn’t that difficult to write. It doesn’t differ much from the previous lexer (although keep in mind we aren’t dealing with dotted pairs any more), and the biggest change is really just the bookkeeping of using the right position at the time of token construction:

fun makeLexer (rdr : (char * Pos.t, 'a) reader) : (token * Pos.t, 'a) reader =
    let
       fun skipWS rdr s =
           case rdr s of
               NONE => s
             | SOME ((#" ", _), s') => skipWS rdr s'
             | SOME ((_   , _), s') => s
    in
       fn s =>
          case rdr (skipWS rdr s) of
              NONE => NONE
            | SOME ((#"(", p), s') => SOME ((LParen, p), s')
            | SOME ((#")", p), s') => SOME ((RParen, p), s')
            | SOME ((_, p), s') =>
              case getAtom (rdr, skipWS rdr s) of
                  NONE => NONE
                | SOME (atom, s') => SOME ((Atom atom, p), s')
    end

The function getAtom that extracts an atom as a string from the stream is largely unchanged, but there is again some extra bookkeeping, in this case updating the pattern matching to just ignore the position that’s attached to every character.

fun getAtom (rdr : ((char * Pos.t), 'b) reader, s : 'b) : (string * 'b) option =
    let
       fun done [] _ = NONE
         | done acc s = SOME (String.implode (rev acc), s)

       fun getAtom' acc s =
           case rdr s of
                NONE => done acc s
              | SOME ((#"(", _), rest) => done acc s
              | SOME ((#")", _), rest) => done acc s
              | SOME ((x,    _), rest) =>
                if Char.isSpace x then
                   done acc s
                else getAtom' (x :: acc) rest
    in
       getAtom' [] s
    end

Along with the supporing infrastructure, we can now use this code to tokenize any stream (with a suitable reader implementation) into tokens with character position attached to them. For instance, given a regular character reader that operates on strings (this is inefficient, but fine for our purposes):

fun getc1 "" = NONE
  | getc1 s  = SOME (String.sub (s, 0), String.substring (s, 1, size s - 1))

val SOME (#"f", "oo") = getc1 "foo"

And a positional reader constructed from our string reader:

val getc2 = Pos.reader getc1

val SOME ((#"f", p), ("oo", p')) = getc2 (Pos.stream "foo")

val 1 = Pos.line p
val 0 = Pos.col p

val 1 = Pos.line p'
val 1 = Pos.col p'

We can finally construct a lexer, that attaches line and column numbers to tokens that it produces. This is then used by the parser to report errors, as well as attach the position of various syntax elements to the abstract syntax tree. I’ll try to cover that in my next post.

val lex   = Lexer.makeLexer getc2

val SOME ((LParen, p), ("foo)", p')) = lex (Pos.stream "(foo)")
val 1 = Pos.line p
val 0 = Pos.col p

Positional Streams

Jan 19th, 2014

Previously, I hinted that attaching line and column numbers to a token stream would be a challenge. While it didn’t actually take the four months since I wrote that for me to figure out how to do it, getting it to work without forcing a rewrite of all my stream processing code was tricky. Inevitably, the result is a simple composition of higher order functions, but that belies how much of a pain it was to iron out.

Recall that the primary method of interacting with streams is via readers, which are simply peek functions with a fancy name:

type ('a, 'b) reader = 'b -> ('a * 'b) option

positions

The first thing we need in order to start tracking line and column numbers is some representation of a position in a file. While a pair would suffice, I think it’s a little clearer to use a record with line and col fields, but then wrap that in a module with an abstract type (i.e. with opaque signature ascription). This reduces the clutter when printing streams and elements at the REPL along with attached positions. There are of course many great reasons to use abstract types: modularity, encapsulation, etc… but I’m not that embarrassed to admit that getting the concise printed representation - is half of why I use them most of the time.

type t = {col: int, line: int}

There are a few helpers that will make our lives a bit easier, and also allow consumers of this type to actually do anything useful with it, like print an error message specifying line and column:

val zero = {col = 0, line = 1}

fun col      {col, line} = col
fun line     {col, line} = line

fun incrCol  {col, line} = {col = col + 1, line = line}
fun incrLine {col, line} = {col = 0,       line = line + 1}

fun show {col, line} = Int.toString line ^ ":" ^ Int.toString col

Some of these are standard for dealing with abstract types: you need to provide all of the ways a user would construct and manipulate values of this type. In this case we have a zero value indicating the beginning of a stream, selectors (col, line) to get at the fields, applicative updaters (incrCol, incrLine) and a show function. Bog standard stuff, but note along with this that using abstract types rather than “raw” tuples would let us extend this to include a third field like absolute character position without updating as much code.

positional streams

Finally, the more interesting bits are how we deal with streams and readers. If a position is a value of this new type, what’s the type of a positional stream, one that keeps track of line and column number? And how to implement a reader to go along with it? I settled on a simple, non-opaque representation: a pair of stream and position representing a positional stream, and a pair of element and position for the elements generated by the stream. This code sort of writes itself but there is some subtlety to the reader.

fun stream s = (s, zero)

fun reader rdr =
    fn (s, p) =>
       case rdr s of
           NONE            => NONE
         | SOME (#"\n", t) => SOME ((#"\n", p), (t, incrLine p))
         | SOME (x,     t) => SOME ((x,     p), (t, incrCol  p))

This function takes a character reader and returns a new reader that works the same way as the first except that it also attaches positions to each character and to the stream. Note that the new reader is responsible for updating the position in the stream, and so must unpack the value produced by the first reader and increment the line number if it’s a new line, otherwise update the column number.

tricky bits

So, now we have a way of constructing positional readers from raw character readers. So far, so good. Things break down a little when we attempt to use our growing body of general stream processing code. Take a look at the type of skipWS, our example from earlier that strips whitespace off the front of a stream:

val skipWS : (char, 'a) reader -> 'a -> 'a

This function explicitly takes a raw character reader, and does then its work on the stream. But there’s no way we can use this with a positional reader and positional streams, since the type of element produced by a positional reader is not char but char * Pos.t. (where Pos.t is the type of position, see the full source for the Pos module)

I struggled with this problem for a little while. It seemed at first like there was no easy solution that would make it so you didn’t have to have a whole seperate suite of “general” functions that operated only on positional streams. After a while though, I realized that the solution was relatively simple, I had just been looking at it the wrong way. Rather than try to write a generic skipWS function that operated on both raw and positional streams, I should have been trying to find a way to parameterize the skipWS function in terms of the type of stream. This is something I’m already doing in a lot of stream processing code, since almost any function operating on streams needs access to the appropriate reader for that type of stream.

Take a look at the classic function dropWhile, re-written to work with streams:

fun dropWhile rdr f s =
    case rdr s of
        SOME (x, s') => if f x then dropWhile rdr f s' else s
      | NONE => s

This function depends on access to a rdr which decomposes the stream to it can be filtered and truncated. Here’s skipWS written in terms of dropWhile:

fun skipWS rdr = dropWhile rdr Char.isSpace

The secret was to realize that Char.isSpace is the culprit. It operates on raw characters, when we need it to operate on characters plus positions. This should really be a parameter to skipWS, letting the user decide what kind of stream they are stripping whitespace from. One could even imagine a situation where they weren’t operating on characters at all, but some other type with its own notion of what whitespace means.

At this point, the parameterization of these functions gets out of hand:

fun skipWS rdr isSpace = dropWhile rdr isSpace

Personally, I think two arguments like this is the point where I reach for functors:

signature READER =
sig
   type t
   type s
   val isSpace: t -> bool
   val peek: (t, s) StringCvt.reader
   (* ... *)
end

functor StreamFn(structure Rdr: READER) = struct
  fun dropWhile f s =
      case Rdr.peek s of
          SOME (x, s') => if f x then
                             dropWhile f s'
                          else s
        | NONE => s

  val skipWS = dropWhile Rdr.isSpace
end

structure SubstringReader: READER =
struct
   type t = char
   type s = Substring.substring
   val isSpace = Char.isSpace
   val peek = Substring.getc
end

structure SubstringStream = StreamFn(structure Rdr = SubstringReader)

I’m not too sure about the names of all these modules, but this is at least a start. I’m planning to experiment with this approach while writing a little parser that reports errors in as nice a way as possible, and continue experimenting with this as I begin to integrate something hopefully a little smoother around the edges into my ML compiler, for eventual use in reporting not only syntax errors but also type errors.

Lexical analysis with readers and streams in ML

Sep 9th, 2013

In my last post I described an interesting abstraction for streams in Standard ML, that I “discovered” after some trial and error. I say “discover” with scare quotes because I didn’t exactly make a revolutionary contribution; the documentation in the Basis describes this idea in detail and presents a few examples (see “Discussion” here). But it surprised me when I learned about it, so I thought it would be interesting to other people. Assuming that it is, I thought it might be helpful to walk through the small example I’ve been using as a proof of concept while experimenting with this code.

Before publishing the previous post, I wrote a lexer and a parser for s-expressions, using readers and streams. I chose this because it’s a very small language that’s easy to parse with recursive descent, but it’s also complicated enough that the code illustrates some of the issues you’d encounter writing larger parsers.

crash course on s-expressions

To refresh your memory (or introduce you if you haven’t used Lisp or Scheme) s-expressions are the textual representation of Lisp and Scheme source code and list literals. The subtle relationship between code and data is where Lisp draws much of its metaprogramming power, but that’s quite far from the scope of this post. If you’re not familiar with them, it’s safe to think of s-expressions as a lightweight notation for nested lists:

(1 2 (3 4) 5 (6 (7) 8) 9 0)

Just keep in mind that they can also be used to represent Lisp and Scheme code:

some Scheme code

(define (fact n)
  (if (= n 0) 1
      (* n (fact (- n 1)))))

To make it a bit more interesting, I decided to implement another feature of s-expressions: “dotted pairs” or “improper lists”. In Lisp, lists are constructed using the familar cons function, which can be used the same way it is in ML. In ML the type of cons (i.e. ::) is 'a * 'a list -> 'a list, and the compiler checks this for us. In Lisp, without static types, cons can be used the same way, and usually is: the first argument is something atomic, and the second argument is a list. The empty list or nil denotes the end of the list, just as in ML. But it’s possible to use it to construct arbitrary pairs, and for the second element to be atomic too, in which case that pair is printed with a dot:

> (cons 1 (cons 2 nil))
(1 2)
> (cons 1 2)
(1 . 2)
> (cons 1 (cons 2 (cons 3 4)))
(1 2 3 . 4)

reader

Recall that a reader is a peek function over a stream; it takes a stream and returns an element and new stream (wrapped in 'a option). Concretely, the type of a reader, as defined in StringCvt is:

type ('a,'b) reader = 'b -> ('a * 'b) option

…where 'b is the type of stream and 'a is the type of element in the stream. For instance, the type of reader for characters and strings is string -> (char * string) option.

tokens

Lexical analysis is the process of breaking input into words or tokens. It can also be thought of as assembling a higher level object (tokens) from a lower level one (characters). Lexers usually also remove comments and normalize whitespace, as appropriate for the programming language in question.

Our lexer will be a function named tokenize that operates on readers. Its argument is a character reader and it will return a token reader. The argument produces characters from some type of stream, and the returned reader will produce tokens from the same type of stream:

tokenize_type

val tokenize : (char, 'a) StringCvt.reader -> (token, 'a) StringCvt.reader

There are only four kinds of tokens in the little s-expression language: left (opening) parentheses, right (closing) parentheses, atoms, and a period (for dotted pairs and improper lists).

token_datatype

datatype token = LParen | RParen | Atom of string | Dot

Three of the tokens are single characters, while atoms are the only token longer than one character. The reader returned by tokenize will need to peek at least one character ahead in the stream to determine what to do. If that character is one of the single character tokens, all it needs to do is return that token along with the rest of the stream. If the character is some other non-whitespace character, it should read ahead until seeing some character not part of the atom, i.e. one of the other tokens or whitespace. Finally, it should just skip over the whitespace between tokens.

whitespace

Stripping whitespace from streams, with readers, is something that nearly every character and string processing program will want to do, so there’s a function already provided in the StringCvt module for just this purpose: skipWS. It was helpful for me to understand how to implement this myself, just to get a better intuition for streams and readers:

fun skipWS rdr s =
    case rdr s of
        NONE => s
      | SOME (x, s') =>
        if Char.isSpace x then
           skipWS rdr s'
        else s

This function is an example of a more general pattern for readers and streams: it’s primarily a function on streams, but since it needs a particular reader to work with any particular stream, it takes a reader as the first argument, a stream as the second argument, and it returns a stream. It discards characters from the stream with the provided reader until it sees one that is not whitespace, returning the stream at that point. For example:

- skipWS Reader.string "   foo" ;
val it = "foo" : string

Here, Reader.string is a string reader, i.e. a value of type (char,string) StringCvt.reader. The function skipWS will work with any type of character stream, whether backed by strings, substrings, lists, or files:

- skipWS Reader.list [#" ",#" ",#" ",#"f",#"o",#"o"] ;
val it = [#"f",#"o",#"o"] : char list
- Substring.string (skipWS Reader.substring (Substring.full "   foo")) ;
val it = "foo" : string

- let val out = TextIO.openOut "scratch"
      val _ = TextIO.output (out, "   foo")
      val _ = TextIO.flushOut out
      val fromFile : string -> TextIO.StreamIO.instream =
          TextIO.getInstream o TextIO.openIn
      val inp = fromFile "scratch"
in
   TextIO.StreamIO.inputAll (skipWS Reader.streamIO inp)
end ;

val it = ("foo",-) : TextIO.StreamIO.vector * ?.TextIO.instream

Again, Reader.list, Reader.substring, Reader.streamIO are all just readers for various concrete stream types. You can see all of the Reader struct in the source for this post. I’m tempted to build up a small structure with general reader-based stream processing code in the vein ofskipWS. But the implementations of the readers themselves are actually scattered throughout the Basis, for instance Substring.getc is actually a reader. If you’re just messing around, you can use those.

tokenize

With this under our belt, we can move on to writing the lexer proper:

tokenize_impl

fun tokenize rdr =
    fn s =>
       case rdr (StringCvt.skipWS rdr s) of
           NONE => NONE
         | SOME (#".", s') => SOME (Dot, s')
         | SOME (#"(", s') => SOME (LParen, s')
         | SOME (#")", s') => SOME (RParen, s')
         | SOME (_, s') =>
           case getAtom rdr (StringCvt.skipWS rdr s) of
               NONE => NONE
             | SOME (atom, s') => SOME (Atom atom, s')

Remember that this function takes a reader, and returns another reader. Since readers are functions that operate on streams, tokenize returns another function, so the body is just a lambda expression (fn s => on line 2).

The first thing we do is apply the skip whitespace function from StringCvt to the stream, then apply the reader (rdr (StringCvt.skipWS rdr s) on line 3). This gives us either NONE, indicating the stream is empty, or SOME with the peeked character and the rest of the stream. Note that since we remove the leading whitespace, we know this character is not a space. We either return a single character token or call a helper function named getAtom, which takes a reader and a character stream, and returns the next atom in the stream as a string, and the rest of the character stream.

This is another example of a common pattern, in this case for stream processing functions that take a reader and return another reader. The structure of the branches of the case expression is the important takeaway. If we try to peek into the first element of the stream and find nothing (i.e. NONE) then we should just return NONE, because the stream is empty. If there is an element in the stream, we transform that element somehow (in this case converting it to a token), and return the result of the transformation and the rest of the stream, wrapped in SOME.

getAtom_impl

fun getAtom rdr s =
    let
       fun return [] _ = NONE
         | return acc s = SOME (String.implode (rev acc), s)

       fun getAtom' acc s =
           case rdr s of
                NONE => return acc s
              | SOME (#"(", rest) => return acc s
              | SOME (#")", rest) => return acc s
              | SOME (x, rest) => if Char.isSpace x then
                                     return acc s
                                  else getAtom' (x :: acc) rest
    in
       getAtom' [] s
    end

The function getAtom reads characters from the stream until it sees either a parentheses or a space. This is closer to the skipWS function than to tokenize, in terms of the shape of the code. Instead of discarding characters as skipWS does, it collects them and returns them as a string, along with the rest of the stream.

Together, these functions complete the lexical analyzer. Using a little helper function consume that repeatedly peeks from a stream and accumulates the elements into a list, we can run some ad-hoc tests:

tests

val [LParen, Atom "foo", RParen] = consume (tokenize string) "(foo)"
val [LParen, Atom "foo", Atom "bar", RParen] = consume (tokenize string) "(foo bar)"
val [LParen, Atom "foo", Dot, Atom "bar", RParen] = consume (tokenize string) "(foo . bar)"

That concludes this walkthrough. I hope it’s helpful, and demystifies lexical analysis a little bit (convincing you that you need neither ANTLR nor the Dragon book to write a little lexer) and also elaborates on the reader abstraction and how to make use of it. The code in this post, along with supporting boilerplate, is on Github, both as an Org-mode/Babel file and a tangled code-only .sml source file. If this whets your appetite, you could try implementing comments, which should be straightforward, or attach line and column number to the tokens, which is much harder. I may cover those in a later post, and later still cover the parser.

Polymorphic streams in ML

Aug 31st, 2013

For a while now, I’ve been meaning to refactor my ML compiler’s front-end in order to support reading source code from files instead of only from strings in memory. Right now, the type of the lexical analyzer is string -> token list (a function from strings to lists of tokens) and I hoped it would be possible to change this to char stream -> token stream (a function from streams of characters to streams of tokens), where the 'a stream type would be some abstraction over sequences, such as strings but also file handles. Then the lexer could ignore the underlying details of the input, treating string input (for example in unit tests) the same way as file I/O (for example when compiling user programs). If they could both use the same type of stream (i.e. both instances of some type 'a stream), that would be sort of nice and elegant, plus allow the parser (which would consume a token stream) to share any general stream processing code I wrote for the lexer. So I started to look for a stream library for Standard ML.

Functional I/O

The Standard ML Basis has a functional I/O library based around streams. This module, TextIO.StreamIO, has a signature that should be quite natural for most functional programmers to use, and supports the kind of higher-order functions you might be used to using on lists, for instance takeWhile and filter, both common somewhat operations on input, which can be easily implemented to work with the StreamIO interface:

(* Construct I/O streams from files ... *)
val fromFile : string -> TextIO.StreamIO.instream =
    TextIO.getInstream o TextIO.openIn

(* ... and from strings *)
val fromStr : string -> TextIO.StreamIO.instream =
    TextIO.getInstream o TextIO.openString

 local
    structure S = TextIO.StreamIO
 in

 (*
  * Reads chars from the stream s while p x is true
  *)
 fun takeWhile p s =
     let
        fun takeWhile' acc s =
            case S.input1 s of
                NONE => (String.implode (rev acc), s)
              | SOME (x, s') =>
                if p x then
                   takeWhile' (x :: acc) s'
                else (String.implode (rev acc), s)
     in
        takeWhile' [] s
     end

 (*
  * Filters stream s by predicate p
  *)
 fun filter p s =
     let
        fun filter' acc s =
            case S.input1 s of
                NONE => (String.implode (rev acc), s)
              | SOME (x, s') =>
                if p x then
                   filter' (x :: acc) s'
                else filter' acc s'
     in
        filter' [] s
     end
 end

 val s = fromStr "fooBar"
 val ("foo", s') = takeWhile Char.isLower s

 val t = fromStr "aBcDeF"
 val ("ace", t') = filter Char.isLower t

There are several problems with this code. First of all filter should really return another stream, but in this case it returns consumes all the input from the stream and returns it as a string. This could be pretty bad if the file was large, but even leaving issues of efficiency aside, it’s simply a bad interface. Ideally it should be possible to apply transformations like filter to streams and pass the resulting streams to other functions that can consume the streams, oblivious to the filtering.

Furthermore, we’re restricted to streams of characters, so if I used this for the input stream, the lexer would have to return some other stream type, not a stream based on the TextIO.StreamIO module. It may be possible to construct, from lower level pieces of the primitive I/O modules, a value of type TextIO.StreamIO.instream that consists of elements that are not characters, however after reading through the documentation and encountering this monster record, I realized this module is very I/O focused and while it might be possible it would be really hairy. Not yet discouraged, I decided to explore other options for a custom stream type that was polymorphic in its element, but still relatively simple.

Naive streams

Unfortunately, after some experimentation, I realized this is a more challenging task than I initially thought. While it certainly may be possible with some type gymnastics (I’m generally pretty conservative when claiming something is impossible in ML, after seeing the kinds of type hackery exhibited on the MLton wiki) it isn’t clear to me how to structure the code such that it passes the type checker. The problem seems to be that because I/O streams and strings would always produce values with concrete types (characters or bytes), there’s an inherent disagreement about the type of functions that operate on streams. For example:

(* This code does not compile *)
signature STREAM =
sig
   type 'a t
   val fromList : 'a list -> 'a t
   val fromIO : TextIO.StreamIO.instream -> char t
   val fromString : string -> char t
   val peek : 'a t -> ('a * 'a t) option
end

structure Stream : STREAM =
struct
   datatype 'a t = String of int * string
                 | List of 'a list
                 | IO of TextIO.StreamIO.instream
   val fromList = List
   val fromIO = IO
   fun fromString s = String (0, s)
   fun peek (String (idx, str)) =
       if idx < (String.size str) then
          SOME (String.sub (str, idx), String (idx + 1, str))
       else NONE
     | peek (List []) = NONE
     | peek (List (x::xs)) = SOME (x, List xs)
     | peek (IO ins) =
       case TextIO.StreamIO.input1 ins of
           NONE => NONE
         | SOME (x, ins') => SOME (x, IO ins')
end

stdIn:19.1-37.4 Error: value type in structure doesn't match signature spec
    name: peek
  spec:   'a ?.Stream.t -> ('a * 'a ?.Stream.t) option
  actual: char ?.Stream.t -> (char * char ?.Stream.t) option

The type of Stream.peek can’t be polymorphic since for some values it always returns characters. There are other, related problems with this code, specifically values constructed with Stream.IO and Stream.String are polymorphic, which becomes complicated by the value restriction.

The STREAM signature is inspired by a stream module from the OCaml standard library. Obviously, this module does type check in OCaml. How the OCaml version works is not clear to me, but after discovering multiple calls to Obj.magic (whose type is 'a -> 'b) in the OCaml implementation, specficially applied to the value read from the actual I/O stream, I realized that there might be a hole punched in the type system for this to work, and I clearly don’t know enough OCaml to understand it.

Reader

At some point in my search, I had a breakthrough when I stumbled onto the StringCvt module in the Standard ML Basis. It’s an innocuously named module related to converting strings into values, used for instance by Int.fmt and Int.scan. However, what is lexical analysis and parsing but converting strings (or char streams) into values? It turns out that there is an extremely simple type alias in this module that is a very solid foundation for a larger body of stream processing code: the lowly ('a, 'b) reader. This is just a name for the type 'b -> ('a * 'b) option, defined in code very simply as just:

type ('a,'b) reader = 'b -> ('a * 'b) option

From the description: “The type of a reader producing values of type ‘a from a stream of type ‘b. A return value of SOME(a,b) corresponds to a value a scanned from the stream, plus the remainder b of the stream. A return value of NONE indicates that no value of the correct type could be scanned from the prefix of the stream.”

This should be quite clear, but let me reiterate: A value of type reader is not a stream, but a function of one argument, which is a stream. If there are no more elements left in the stream, the return value is NONE. If there are elements, the return value is SOME (x, s'), where x is the next element in the stream, and s' is the rest of the stream after consuming x.

The insight is that this allows you to more easily compose versions of higher-order functions like map, filter and friends that operate on streams.

structure Reader =
struct

local
   open String
   open StringCvt
in

val list : ('a, 'a list) reader =
 fn [] => NONE
  | (x::xs) => SOME (x, xs)

val string : (char, string) reader =
 fn "" => NONE
  | s => SOME (sub (s, 0), substring (s, 1, size s - 1))


val instream : (char, TextIO.StreamIO.instream) reader =
    TextIO.StreamIO.input1

fun map (f : 'a -> 'c) (rdr : ('a, 'b) reader) : ('c, 'b) reader =
    fn s =>
       case rdr s of
           NONE => NONE
         | SOME (x, s') => SOME (f x, s')

fun filter (p : 'a -> bool) (rdr : ('a, 'b) reader) : ('a, 'b) reader =
    let
       fun rdr' s =
           case rdr s of
               NONE => NONE
             | SOME (x, s') =>
               if p x then
                  SOME (x, s')
               else rdr' s'
    in
       rdr'
    end
end

end

The type of map for streams is slightly different than the usual map over lists. Both of them take a transformation function with the type 'a -> 'b, transforming values of type 'a to 'b. The regular map over lists then takes a list of 'a and returns a list of 'b. You might think that map over streams would take a stream of 'a and return a stream of 'b but with the reader abstraction the type is a little bit different: instead it takes a reader producing values of type 'a from a stream of type 'c and returns a reader that produces values of type 'b from a stream of type 'c. So it’s a function that operates on readers rather than streams, but this turns out to work quite well, I think. Filter works the same way.

val toUpper = Reader.map Char.toUpper Reader.string
val SOME (#"F", "oo") = toUpper "foo"

val upperOnly = Reader.filter Char.isUpper Reader.string
val SOME (#"B", "ar") = upperOnly "fooBar"

What’s next

I’ve started to write a few small lexers and parsers using readers, and I think that it’s a nice way to abstract over the possible input values that string and character processing programs consume. Of course, good sequence abstractions abound in programming languages and libraries, and exploring other approaches to tackling this problem is still an open question for me to explore further. For instance, while I’m somewhat familiar with Clojure’s seq protocol, I do not have a strong intuition for all of the details. It would be interesting to see how Haskell in particular deals with this, as it’s closer to ML, and of course understanding OCaml’s implementation is high on my list. Implementing lazy streams using the reader another thing I have yet to try. Clearly, I’m only just scratching the surface, but in the mean time, I’m ready to move on with my refactoring.

A tale of two parsers, part 2

Jun 30th, 2013

Or, more notes on recursive descent

Previously, I wrote about pitfalls I encountered while writing a recursive descent parser for a little interpreter. I had planned to write about porting that procedural, impure ML parser to Haskell using simple parser combinators, and then with the Parsec library.

However, since writing that post, I discovered a severe bug in the parser. In this post, I’ll describe the bug, how I fixed it and what I learned in the process. Haskell and Parsec will have to wait for now (tl;dr Parsec is pretty neat, but it’s easy to rely on infinite lookahead instead of carefully writing LL(1) parsers).

Parsing function application

I’ve adapted the prototype I described before into the front-end for an ML compiler I’ve been working on, since at its core ML is syntactically similar (semantically similar as well of course) to the lambda calculus. Recall that the parser should enforce the usual rules of precedence and associativity in arithmetic expressions: 1 + 2 * 3 should be parsed as 1 + (2 * 3). This is relatively well-trodden ground, and most compiler texts address this by introducing precedence levels into the grammar.

In a full-fledged programming language like ML, there are other expressions that must have even higher precedence than multiplication, but aren’t usually present in the simple calculator language, for instance function application (or function calls). In ML, function application is simply one expression followed by another: f x instead of f(x) as is the case in most languages with syntax derived from C. This poses a problem for a recursive descent parser, since the resulting grammar has a production that looks something like:

expr -> expr expr (* function application *)
expr -> if expr then expr else expr
expr -> fn id => expr
expr -> let id = expr in expr
...

This was the grammar I wrote in order to add function application to the parser I started with that did arithmetic only, and it grew out of my belief that as a functional, expression oriented language, ML should allow any expression to be called as a function in an application. For instance, the following are all legal ML programs:

val x = (fn x => x) 1
val z = let val x = 1 in fn y => x + y end 2

However, the grammar poses a real problem for the parser writer: the production expr -> expr expr is left-recursive. This production is a pain to fix, and in fact I wasn’t able to figure out how to remove the recursion, since the usual algorithm breaks the production into two pieces but the result in this case still has a left recursion.

The intention with this grammar is that the following snippet of ML should be parsed as a curried function application, allowing the usual lambda calculus style of supporting multiple arguments as in ML and Haskell:

fun add x y z = x + y + z
val sum1 = add 1 2 3
val sum2 = (((add 1) 2) 3)

What I decided was to attempt to fix the problem of parsing function application by ignoring support for curried functions, and add the following production to the grammar:

exprs -> expr expr
expr  -> if exprs then exprs else exprs
expr  -> fn id => exprs
expr  -> let id = exprs in exprs
...

This grammar no longer has the problem that the previous one has, since exprs is not left-recursive. The exprs production can be implemented straightforwardly in the existing recursive descent parser:

and exprs () : 'a A.t =
    let
       (* check if token is in FIRST(expr) *)
       fun FIRSTexpr (L.Id (pos, _))   = SOME pos
         | FIRSTexpr (L.Num (pos, _))  = SOME pos
         | FIRSTexpr (L.Bool (pos, _)) = SOME pos
         | FIRSTexpr (L.If pos)        = SOME pos
         | FIRSTexpr (L.Fn pos)        = SOME pos
         | FIRSTexpr (L.Let pos)       = SOME pos
         | FIRSTexpr (L.LParen pos)    = SOME pos
         | FIRSTexpr _                 = NONE

       val ast1 = expr ()
    in
       if has ()
          then case FIRSTexpr (peek ()) of
                   SOME pos => A.App (pos, ast1, expr ())
                 | NONE => ast1
       else ast1
    end

On line 13, this code parses a single expression, then proceeds to look ahead one token (via has (), which returns true is there are remaining tokens and peek () which returns the next token but does not advance the input). It only parses a second expression if that token is in the set FIRST(expr) (tokens that could possible begin an expr). If the token is not in FIRST(expr), this procedure simply returns the first expression it parsed (the function FIRSTexpr returns an option value rather than true or false because it’s returning the position record which keeps track of line and column number for tokens and AST nodes, for error handling, but it’s not important for this discussion).

This technique works, sort of. It does correctly parse function application expressions and doesn’t support curried functions, which is what we expected. However the new grammar is deeply flawed, and here’s a bit of code that shows how wrong it is:

f x + g y

This code should be parsed as:

(f x) + (g y)

But because I added the new production to the top of the grammar (by which I mean the top of the precedence hiearchy), this caused function application to have the lowest precedence, rather than the highest, which is the usual precedence assigned to application in ML and most languages I’m aware of with this syntax. So with the parser I had, this code is parsed as:

f (x + g)

Pretty bad!

The Definition of Standard ML

I started researching this problem, because it occured to me that it might very difficult or even impossible to parse ML with recursive descent. I didn’t find any evidence that this was the case, and in fact I stumbled upon a paper written by Andrew Appel and David MacQueen about the development of SML/NJ, mentioning that, “Early in the development of the compiler we used a hand-written lexical analyzer and a recursive-descent parser.” This was encouraging, to say the least. I also found a few discussions about ML’s syntax that mentioned features that make it not LL(k), however none of them specifically mentioned function application, instead referencing completely separate issues like distinguishing datatype constructors from variables and dealing with user-defined infix operators.

Eventually I hit upon a few pages in the appendix of The Definition of Standard ML which talk about syntax. I had known about The Definition but had avoided opening it simply because I assumed it would be far too dense and difficult for me to make much use of. I also (mistakenly) assumed it wouldn’t necessarily get into such hairy, pedestrian details of implementation like how to parse source code. As it turns out, not only are some parts of it pretty accessible, but there is a pretty thorough coverage of syntactic issues that doesn’t go as far as defining a grammar suitable for parsing with no modification, but does reveal some interesting properties of ML’s syntax, including how to resolve the problem I was facing with function application.

Appendix B begins with a specification of several classes of phrases that make up Standard ML programs: AtExp, AppExp, InfExp, and Exp. This distinction is subtle but specifically addresses the issue with the grammar I had put together above. Here is the grammar in the appendix, simplified a little bit for clarity:

atexp ::= <constant>
          <variable>
          <constructor> (datatype, exception)
          let <dec> in exp ; ... ; exp end
          ( exp )
appexp ::= <atexp>
           <appexp> <atexp>
infexp ::= <appexp>
           <infexp> <id> <infexp>
exp ::= <infexp>
        if <exp> then <exp> else <exp>
        while <exp> do <exp>
        case <exp> of <match>
        fn <match>
...

There is obviously more to the full grammar, but this snippet illustrates something important, namely that the following program is syntactically illegal in Standard ML:

val illegal1 = Int.toString if true then 1 else 2

Essentially, there is a relatively restricted class of expressions that are allowed to be in the function or argument position of an application without being parenthesized: simple expressions like variables, constants, records, selectors (e.g. #foo), tuples, lists, and let expressions. All other expressions must be parenthesized to be either applied as functions or passed as arguments to functions.

In hindsight, this makes a lot of sense, since there are some “open-ended” expressions that can’t possibly occur in the function position: if, while, case and fn all lack terminating end keywords unlike let, so if they were allowed to be applied, any expression occuring after them would be ambiguous, and could be part of the body or considered an argument. It’s not as clear to me why the program above is illegal, but requiring both the function an argument to be “atomic” expressions might make for a cleaner grammar.

Adding this separation to my grammar and to my parser wasn’t difficult, and now function applications would be parsed correctly with the highest precedence when mixed with infix operators. Here’s the code for the expr production:

fun expr () : 'a A.t =
    (log "expr";
     case peek () of
         L.If pos =>
         (adv ()
         ; let val e1 = expr ()
           in case peek () of
                  L.Then _ => (adv ()
                              ; let val e2 = expr ()
                                in case peek () of
                                       L.Else _ => (adv ()
                                                   ; A.If (pos, e1, e2, expr ()))
                                     | t => expected "else" t
                                end)
                | t => expected "then" t
           end)
       | L.Fn pos =>
         (adv ()
         ; case peek () of
               L.Id (pos', x) => (adv ()
                              ; case peek () of
                                    L.Arrow _ => (adv ()
                                                 (* FIXME: two ids for Fn *)
                                                 ; A.Fn (pos', pos, x, expr ()))
                                  | t => expected "=>" t)
             | t => err ("expected formal arg in fn expr, got " ^ L.show t))

       | L.Match pos =>
         (adv ()
         ; let val e1 = expr ()
           in case peek () of
                  L.With _ => (adv ()
                              ; A.Match (pos, e1, clauses ()))
                | t => expected "with" t
           end)

       | _ => infexp ())

It peeks one token ahead (case peek () of on line 3) and then depending on the token found, either parses an if, fn, or match expression (match in my compiler is the same as case in SML, and match in OCaml, the reason for using match instead of case is tied up in the work I’ve been doing to compile pattern matching, which deserves a whole separate post), or defers to parse an infix expression with infexp.

and appexp () : 'a A.t =
    (log "appexp";
     appexp' (atexp ()))

(*
 * lhs is the left hand side of the (potential) application
 *)
and appexp' (lhs : 'a A.t) : 'a A.t =
    (log "appexp'";
     if has () andalso FIRSTatexp (peek ())
        then appexp' (A.App (A.getInfo lhs, lhs, atexp ()))
     else lhs)

and atexp () : 'a A.t =
    (log "atexp";
     case peek () of
         L.Let pos =>
         (adv ()
         ; case peek () of
               L.Val _ =>
               (adv ()
               ; case peek () of
                     L.Id (_, x) =>
                     (adv ()
                     ; case peek () of
                           L.Eqls _ =>
                           (adv ()
                           ; let val bound = expr ()
                             in case peek () of
                                    L.In _ =>
                                    (adv ();
                                     let val body = expr ()
                                     in case peek () of
                                            L.End _ => (adv (); A.Let (pos, x, bound, body))
                                          | t => expected "end" t
                                     end)
                                  | t => expected "in" t
                             end)
                         | t => expected "=" t)
                   | t => err ("expected bound var in let expr, got " ^ L.show t))
             | t => expected "val" t)
       | L.Num (pos, n) => (adv (); A.Num (pos, n))
       | L.Bool (pos, b) => (adv (); A.Bool (pos, b))
       | L.Id (pos, s) => (adv (); A.Id (pos, s))
       | L.LParen _ => (adv (); let val ast = expr ()
                                in case peek () of
                                       L.RParen _ => (adv (); ast)
                                     | t => expected ")" t
                                end)
       | t => expected "let, id or constant" t)

The code to parse let expressions is still just as hairy as before (and definitely could or should be improved!) but the important part here is the distinction between appexp, atexp, infexp, and expr, which represent the same classes of phrases described in The Definition. The result is that this code not only correctly parses function application with higher precedence than all the infix operators, but also correctly parses curried function application (as left-associative).

The lessons I learned here were simple: (a) do some research and (b) sometimes there isn’t a fully general, elegant solution (in this case, the production expr -> expr expr) and sometimes introducing a level of complexity (the different classes of expression) is necessary.

The complete parser code is here. I hope to follow this post up with another one detailing changes I made to the infix expression parser, and how the technique I ended up using turned out to be a perfect fit for ML in another part of syntax that is hairy to parse with recursive descent.

A tale of two parsers, part 1

Jun 22nd, 2013

Or, notes on recursive descent

I worked on a little lambda calculus interpreter with fellow Hacker Schooler last week. We wrote it in Haskell since he was starting out learning Haskell and I’ve been meaning to come to grips with Haskell and develop some general proficiency in it (instead of just faking my way through pretending it’s ML). I’ve written little interpreters before, so to ensure I was working at the edge of my ability and knowledge, I decided to add a front-end (lexer and parser) to our interpreter. Previously, I’ve always relied on Scheme’s (or Lisp’s) reader to do parsing for me. Even though the language is tiny, this would be the largest parsing job I’ve ever tackled (actually not entirely true, my first task at an internship many years ago was to write an XML parser in Java, however I suspect that my manager at the time was trolling or hazing me, and anyway I didn’t really have any idea what I was doing. This time at least I’d read the first few chapters of the Dragon book).

Back to ML

I prototyped the frontend in Standard ML because all of the parsing code and pseudo-code I’ve read has been relatively imperative, and I knew I’d be able to reach for mutable references in Standard ML if I needed them (this is one reason I generally prefer ML to Haskell, I think using impure code to patch over incomplete design is rather useful, something I learned to put into words after reading Matt Might’s description). That turned out to be pretty helpful, and the Haskell version took a drastically different approach in the end, so it was probably good I didn’t try to tackle that first.

Something that was pretty important was that the parser should return an actual abstract syntax tree, allowing the interpreter to do the real work of evaluating the program. It would do no good to simply check that the input conformed to the grammar of the language, which is what many or most parsers in compiler text books do. Accumulating and returning an abstract syntax tree was something I could envision being able to do in the course of implementing a parser, but it wasn’t obvious to me how that worked, especially in the framework of a purely or mostly functional language, where destructively updating fields in an AST node is mostly out of the question (Haskell) or at least strongly inadvisable (SML).

The parser I wrote in SML works by recursive descent, and since our language supports infix arithmetic expressions, I had to deal with some classic problems like correctly implementing precedence of multiplication over addition, as well as left associativity. Together, getting these right was the hardest part of writing the parser.

Arithmetic

Here’s the classic grammar for arithmetic expressions, transformed to remove left-recursion and to enforce a higher precedence for multiplication over addition.

expr   -> term expr'
expr'  -> + term expr'
expr'  ->
term   -> factor term'
term'  -> * factor term'
term'  ->
factor -> ( expr )
factor -> id
factor -> num

One mistake I did when starting out on the parser was to immediately refactor this grammar to remove what I thought was a redundancy:

expr   -> term expr'
expr'  -> + expr        // production changed
expr'  ->
term   -> factor term'
term'  -> * term        // production changed
term'  ->
factor -> ( expr )
factor -> id
factor -> num

I changed two productions that seemingly had two non-terminals when they could have one. This introduced a subtle bug in the parser. Here’s the code for the expr production, using the second (refactored) grammar above:

expr

fun expr () : ast =
    (log "expr";
     let val lhs = term ()
     in case expr' () of
            NONE => lhs
          | SOME (oper, rhs) => oper (lhs, rhs)
     end)

and term () : ast =
    (log "term";
     let val lhs = factor ()
     in case term'() of
            NONE => lhs
          | SOME (oper, rhs) => oper (lhs, rhs)
     end)

and expr' () : ((ast * ast -> ast) * ast) option =
    (log "expr'";
    if has ()
       then case peek () of
                L.Add => (next (); SOME (Add, expr ())) (* recur by calling expr *)
              | L.Sub => (next (); SOME (Sub, expr ()))
              | _ => NONE
    else NONE)
(* ... *)

Here, the expr' procedure looks one token ahead (case peek () of... on line 20) and returns (on line 21 and 22) a pair of AST node constructor and AST node for the non-epsilon production (and returns NONE in the case of the epsilon production). The second element of this pair is the right-hand side of the operator represented by the constructor. While this is still a left-most derivation since expr is the only non-terminal in this production, the result is a right-associative AST. For instance, for the expression 1 + 2 - 3, this code would produce the AST Add(Num 1, Sub (2, 3)) (where Num is a unary constructor of leaf nodes for numbers). This makes addition and subtraction left associative operations, which is incorrect from the usual conventions in elementary school math as well as programming languages. In a sense, neither the code or the grammar is wrong, the grammar (I believe…) is sort of correct in the theoretical sense that it’s possible to derive a parse tree that’s right associative, and the code tries to implement the grammar as it’s written, but together these things conspired to produce a bug. Fixing it was as simple as removing my mistaken simplification, returning the grammar to the first one above, and then updating the code to match:

expr

fun expr () : ast =
    (log "expr";
     let
        val lhs = term ()
     in
        expr' lhs
     end)

and term () : ast =
    (log "term";
     let
        val lhs = factor ()
     in
        term' lhs
     end)

and expr' (lhs : ast) : ast =
    (log "expr'";
    if has ()
       then case peek () of
                L.Add => (next (); expr' (Add (lhs, term ())))
              | L.Sub => (next (); expr' (Sub (lhs, term ())))
              | _ => lhs
    else lhs)

This had the extra effect of forcing another refactoring of the code, resulting in the expr' procedure taking an AST node as an argument instead of unit, and returning an AST node. It recursively calls itself after calling term and applying the same AST node constructor it returned in the previous version. Parsing the expression 1 + 2 - 3 would now correctly produce Sub(Add(Num 1, Num 2), 3).

Other expressions

Adding other expressions to the parser was easy to do, but very tedious. There’s probably a nicer way to implement these simpler productions (where there the first element is a single non-terminal, and it’s just a question of matching the input to the rest of the non-terminal), but while this code it’s pretty, it is quite obvious and follows directly from the grammar. Here’s the code for the expr production, after adding If, Let and Fn (i.e. Lambda) expressions:

expr

fun expr () : ast =
    (log "expr";
     case peek () of
         L.If =>
         (adv ()
         ; let val e1 = exprs ()
           in case peek () of
                  L.Then => (adv ()
                            ; let val e2 = exprs ()
                              in case peek () of
                                     L.Else => (adv ()
                                               ; If (e1, e2, exprs ()))
                                   | t => expected "else" t
                              end)
                | t => expected "then" t
           end)
       | L.Fn =>
         (adv ()
         ; case peek () of
               L.Id x => (adv ()
                         ; case peek () of
                               L.Arrow => (adv (); Fn (x, exprs ()))
                             | t => expected "=>" t)
             | t => err ("expected formal arg in fn expr, got " ^ L.show t))
       | L.Let =>
         (adv ()
         ; case peek () of
               L.Id x => (adv ()
                         ; case peek () of
                               L.Eq => (adv ()
                                       ; let val bound = exprs ()
                                         in case peek () of
                                                L.In => (adv (); Let (x, bound, exprs ()))
                                              | t => expected "in" t
                                         end)
                             | t => expected "=" t)
             | t => err ("expected bound var in let expr, got " ^ L.show t))
       | _ => expr' (term ()))

Hairy and imperative, but it gets the job done.

Unfunctional

This code is a great example of how flexible ML can be, in the sense that it really allows you to write rather boring and bad procedural code. The entire parse function relies on a local mutable reference pointing to the rest of the stream of input tokens, and a small collection of local functions that handle this reference:

parse

fun parse toks =
    let
       val rest = ref toks
       fun has () = not (null (!rest))
       fun adv () = rest := tl (!rest)
       fun next () = hd (!rest) before adv ()
       fun getNext () = if has () then SOME (next ()) else NONE
       fun peek () = hd (!rest)
       fun match tok = has () andalso tok = peek ()
       fun err s = raise SyntaxError ("err " ^ s)
       fun expected s t = raise SyntaxError ("expected " ^ s ^ ", got " ^ L.show t)
(* ... *)

Refactoring this code to remove the mutable reference and instead pass around the rest of the input is something I still want to do, however since our original interpreter was written in Haskell, I decided to move on to porting this to Haskell, where I’d be forced to figure out how to make it functional. I’ll go over that parser in the next post.

The complete code for this parser is here.

Literate blogging with Org mode and Octopress

Apr 14th, 2013

My blog used to be hosted on Posterous, but as of this post it’s generated by a little bit of Emacs Lisp that turns Org mode files into Markdown files which are then processed by Octopress to produce the site you see here. This has the usual benefits of static site generators (simple to deploy, easy to version in Git, fast, etc) but I hope that using Org mode will also motivate me to experiment with literate programming via Babel.

An Octopress-flavored Markdown backend

There are already a few other ways to publish Org files as Markdown or Jekyll (or Octopress) blog posts, however desipite my short list of requirements, I found them to be a bit lacking. In proper not-invented-here-style, I set out to write some Emacs Lisp to export Org files to something Octopress could work with. I learned a bit about some new features of Org’s internals along the way, as well as about Emacs, since this is the largest project I’ve undertaken using Emacs Lisp.

Org has a new exporting system that’s not released with the latest stable build yet, but is well documented and is more flexible than the existing generic export library. Writing a simple backend that covered a small subset of Org’s features turned out to be pretty easy to do. If you have some need to process Org files that’s not supported out of the box, I’d recommend trying this approach. What follows are notes I took while working on this, however since this version of Org has not been released yet, you should of course refer to Org’s documentation for any inconsistencies and so on you may find. Extra caveats apply here since I’m a novice both with Org and with Emacs Lisp in general.

Since this new export system is not really released yet, you have to clone the latest from Git. The new export functionality is defined in ox.el. Of course, there are some backends packaged with Org already, including one that exports to “ASCII” text which was extremely helpful to use as an example.

Defining a new backend is done with the org-export-define-backend function, which accepts two arguments, a symbol and an alist. The symbol is just the name of your new backend. The alist is what actually defines your backend. The “keys” should be the type of Org syntax element, and the “values” should be a function that can export that type of element to your chosen format.

org-export-define-backend

(org-export-define-backend 'octopress
  '(
    (bold . org-octopress-bold)
    (fixed-width . org-octopress-fixed-width)
    (headline . org-octopress-headline)
    (italic . org-octopress-italic)
    (link . org-octopress-link)
    (paragraph . org-octopress-paragraph)
    (section . org-octopress-section)
    (src-block . org-octopress-src-block)
    (template . org-octopress-template)
))

Here I’ve defined a small backend that supports the bare minimum I needed for this blog. There are many more types of Org syntax that I’m not supporting, but since I’m a novice Org user, I figure when I discover a need for those, I’ll add them to the backend.

how Org parses and exports files

When Org exports a file using this backend, and it comes across an element of the type “bold” for instance, it will call the function associated with 'bold, in this case org-octopress-bold. That function should take three arguments: the node of the abstract syntax tree for that element, the “contents” of that node, and a plist with extra information (called the “communication channel” in the Org docs). It should return the result of exporting that node as a string.

When you publish an Org file, it’s first parsed into an abstract syntax tree, and then the export system calls these functions, in a bottom-up fashion starting with the leaves of the tree. Each function returns a string, and these strings are accumulated together until eventually the function for the root of the tree is called, and the entire document has been converted. This is probably best demonstrated with an example. Here’s a snippet of an Org file:

* Foo

Bar, baz! Qux flarp zoot, zip.

** Bar

Frobnicate blop, Fribble *tele* gorx.

Org will parse this file into the following syntax tree (actually the real tree has much more data attached to each node, and is also recursive making it difficult to print):

org-export-define-backend

(org-data
 (headline "Foo"
           (section
            (paragraph "Bar, baz! ..."))
           (headline "Bar"
                     (section
                      (paragraph "Frobnicate..."
                                 (bold "tele")
                                 "gorx.")))))

Org will walk over this tree and call your backend functions. In this case, the first function it will call is the one associated with the type bold since the node (bold "tele") is the first leaf. In my backend, this is the function org-octopress-bold.

org-octopress-bold

(defun org-octopress-bold (text contents info)
  "Transcode bold text to Octopress equiv of <strong>"
  (format "**%s**" contents))

This is a simple conversion from Org mode to Markdown, and since my backend is very bare-bones, I ignore the other arguments and just wrap the contents string, which in this case will be "tele", in asterisks. (Markdown doesn’t really have a concept of “bold” but instead uses HTML “strong” and “emphasis” tags which most browser’s default CSS renders as bold and italic, respectively)

paragraphs

The next function that will be called is org-octopress-paragraph, since it’s the next least node in the syntax tree.

org-octopress-paragraph

(defun org-octopress-paragraph (paragraph contents info)
  contents)

The contents of the paragraph will be a string, containing the results of transcoding the children of the paragraph, in this case two plain strings and a bold string. While I think there are some subtleties around newlines, the simplest way to deal with paragraphs are to just return the contents unchanged. Of course, if we were writing a new HTML backend, we would wrap the contents in p tags.

To be honest, paragraphs are actuall part of this system I’m a little shaky on. From what I could determine by reading ox.el and doing some experiments, there are a few syntax elements that you must provide transcoder functions for. Paragraphs are one of them. The element types headline, section, and the special type “template” are others that must be provided. The reason for this is that these are intermediary nodes in the syntax tree, so if they are not provided at all, the results of other nodes will never be accumulated.

headlines

Continuing our example parse, a node for section will be transcoded, and in my case I’m using a similar function as for paragraphs, which just returns the contents unchanged. The next node after that will be the headline node for the headline “Bar” in the original Org source.

org-octopress-headline

(defun org-octopress-headline (headline contents info)
  (let ((value (org-element-property :raw-value headline))
        (level (org-element-property :level headline)))
    (concat (apply 'concat (repeat "#" level))
            " "
            value
            "\n"
            contents)))

Markdown has a similar syntax for headlines as Org, but uses pound or hash symbols instead of asterisks. Here, we use the function org-element-property to extract some properties from the AST node headline. We need the raw value which is the headline without asterisks, and the level which is the number of asterisks. In this case, I convert all levels of headlines to Markdown headlines, but if you were to be writing a “real” backend I think you would want to respect the :headline-levels option for the project, and only convert headlines of a certain level. Again, like the paragraph node, the contents are the children of the headline, which includes everything under that headline, so we must concatenate the Markdown-style headline string with the contents so as not to lose the rest of the document.

document template

The export process continues in this fashion, until all the nodes are transcoded, their strings accumulated. There’s a special AST node type called template which represents the root of the Org document. The docs suggest using this to add a preamble and/or postamble to the result. In my case, I wanted to output the YAML front matter used by Jekyll to generate blog posts. The template transcoder function takes only two arguments, the contents string and the info plist. The root AST node is not passed into this function, I assume because the idea is that you’ve already transcoded its children and there’s not really any concrete Org syntax associated with it, so there’s nothing to do with the root node but return the contents, wrapped in pre- or postamble.

org-octopress-template

(defun org-octopress-template (contents info)
  "Accepts the final transcoded string and a plist of export options,
returns final string with YAML frontmatter as preamble"
  (let ((title (car (plist-get info :title)))
        (date (car (plist-get info :date)))
        (time "")
        (frontmatter
"---
layout: post
title: %s
date: %s %s
comments: true
external-url:
categories:
---
"))
    (if *org-octopress-yaml-front-matter*
        (concat (format frontmatter title date time) contents)
      contents)))

This function is an example of using the “communication channel” which is the third argument of the other transcoder functions but in this case is the second. The info plist contains all the metadata about the document that’s defined in the Org “export options template”. It’s from this that we extract the title of the post and the date and add it to the YAML front matter.

code blocks

Source blocks are another area where I customized things to output something specific to Octopress. The Github-style “backtick” code blocks used by Octopress take optional language and name parameters, which are used for syntax highlighting and for adding captions to the source block itself. Similarly, Org supports passing the language to source blocks, and attaching a name to elements in general, so I used that to add this to the backtick code block if present. I also added a little hack to ignore the language if it was something not supported by Pygments.

org-octopress-src-block

(defun org-octopress-src-block (src-block contents info)
  "Transcode a #+begin_src block from Org to Github style backtick code blocks"
  (let* ((lang (get-lang (org-element-property :language src-block)))
         (value (org-element-property :value src-block))
         (name (org-element-property :name src-block))
         (lang-and-name (or (and lang name (format " %s %s\n" lang name)) "\n")))
    (concat
     "```"
     lang-and-name
     value
     "```\n"
     contents)))

Tying it together

Having defined the new backend, all that’s left is a little boilerplate to make this backend available to Org projects:

org-octopress-publish-to-octopress

(defun org-octopress-publish-to-octopress (plist filename pub-dir)
  (org-publish-org-to 'octopress filename ".md" plist pub-dir))

My blog’s Org project alist is then, at the bare minimum:

'(("posts"
   :base-directory "/path/to/blog/root"
   :base-extension "org"
   :publishing-directory "/path/to/octopress/root/source/_posts"
   :publishing-function org-octopress-publish-to-octopress)
  ("blog name" :components ("posts")))

To start a new blog post, I use this little helper to create a new Org file with the right naming convention and export template:

octopress-helpers

(defun new-post (dir title)
  "Create and visit a new .org file in dir named $date-$title.org, ie
Octopress/Jekyll style"
  (interactive "Mdirectory: \nMtitle: ")
  (let* ((date (format-time-string "%Y-%m-%d"))
         (title-no-spaces (replace-regexp-in-string " +" "-" title))
         (dirname (file-name-as-directory dir))
         (filename (format (concat dirname "%s-%s.org") date title-no-spaces)))
    (find-file filename)
    (rename-buffer title)
    (org-insert-export-options-template)
    (rename-buffer filename)))

While I’m working on a post, I can start the Octopress preview server the normal way (rake preview) and then export the current Org file with org-publish-current-file to preview the final output in a browser.

Literate example

In case it wasn’t obvious, this whole post was written using this system, and in fact the entire (albeit small) body of code is contained in the Org file for this post, using Babel’s noweb-style literate facilities. The source for this post is on Github, and contains some extra code not exported, like various helper functions and tests.

Working on this blog post while adding some features and fixing bugs to the code was pretty interesting, even as a small taste of literate programming. I’ve got some ideas for literate posts I’d like to write, so I’m looking forward to experimenting with this and will probably add to the backend as I find new Org features.

Please let me know if you find any errors in the code, it’s definitely just an alpha MVP at this point, but if anyone’s interested in using it I would be pleased to hear from you, and help if you run into problems.

You have to do basic science on your libraries to see how they work.

Jul 19th, 2012

…so said Gerald Sussman, when a Lisper asked why MIT switched from Scheme to Python, and why SICP was replaced with a robotics class. (the full quote or paraphrasing is really brilliant and I think sums up a large part of the experience of my generation of programmers, and not just the ones who went to MIT).

That’s old news, and this post isn’t commenting on that. It’s just that this has been ringing truer and truer for me recently, as I struggle to get some useful work done using Android’s WebView. I think the struggle is paying off, since with the help of the source I’ve been able to cut to some truth in places.

I’ve also been struggling to get unit tests that exercise code interacting with a WebView, and with JavaScript running in a WebView, to work at all. When the code for a unit test is more complicated and requires more careful thought, than the code under test, I think there may be something sour going on.

When working with this framework (WebView on Android, and Android in general) more often then not you end up with this wretched slurry of (relatively) weakly, statically typed Java, untyped XML configuration files, and weakly, dynamically typed JavaScript that’s deeply removed from the REPL. (Yep, I’m throwing brickbats with weasel words here) It makes one yearn for the kind of strict and exacting type safety of Haskell. It’s back to Haskell for me, although not for Android apps, unfortunately. Maybe Scala could fill that role, although I suspect it would be just halfway on a log scale.

I really don’t understand types well enough to whinge in this fashion, so I’ve begun some levelling up there. This thread on Hacker News gave me a lot to ponder. I hope TAPL has the answers.

← Older Blog Archives

Or, check out this one weird trick to make Hindley-Milner blow up! Haskell and ML compilers hate it!

Talks are HARD! Ten minutes talks are REALLY HARD!!

Haskell and ML

Hindley-Milner

let-polymorphism

pathological case

pathological case, take 2

computer SCIENCE

“How to compile Turing machines to ML types”

questions

References and further reading

Links

Papers and other published work

Learning ML

computational complexity theory

Or, handling syntax errors without exceptions

exceptions in ML

exceptions in parsing

specializing either

an error-reporting parser for s-expressions

pretty-printing ASTs

… in ML, of course

positional lexical analyzers

what’s the point of all this

back to the lexer

defining a token type

the actual lexer

positions

positional streams

tricky bits

crash course on s-expressions

reader

tokens

whitespace

tokenize

Functional I/O

Naive streams

Reader

What’s next

Or, more notes on recursive descent

Parsing function application

The Definition of Standard ML

Or, notes on recursive descent

Back to ML

Arithmetic

Other expressions

Unfunctional

An Octopress-flavored Markdown backend

how Org parses and exports files

paragraphs

headlines

document template

code blocks

Tying it together

Literate example

specializing `either`