Or, handling syntax errors without exceptions
I didn’t always find parsing interesting. That might be a surprise judging from the past half dozen blog posts I’ve written, but I got interested in compilers and programming langauge implementation via various Lisp and Scheme books, like SICP and The Little Schemer, so I quickly adopted the “parsing is boring!” mantra that is espoused by some authors.
However, parsing is part of every compiler (even Lisp compilers), and syntax is part of every programming language, so ignoring it forever would never do. I’m enchanted with how programming languages work in part because it feels like peeling back the curtain and finding out that there is no magic, just code. I couldn’t treat parsers as a black box for very long.
Parsing is more than taking a stream of characters and producing an abstract syntax tree; that’s just the bare minimum. Parsers must be usable, by humans who are feeding them source code, and by backends of compilers and interpreters. This means proper error handling, and treating errors as first class citizens. This makes it a much more interesting problem.
exceptions in ML
Standard ML (and OCaml, and the “platonic ML”) has an exception mechanism which should be familiar to many programmers who have used almost any other language with exceptions. In some ways, exceptions are fundamental to ML, since there are certain obvious cases (taking the head of an empty list, dividing by zero, non-exhaustive matches) where the runtime must raise an error of some kind and providing for handling them makes sense. Exceptions are interesting in a few different theoretical ways involving types, but they’re perfectly usable by the working programmer too.
Over use (or frankly even any use) of exceptions is somewhat bad practice in my humble opinion. For the same reason ML programmers would rather opt for the classic
'a option type over pervasive nullable pointers, it seems to me that embedding all possible values in the type of an expression and avoiding exceptions altogether would lead to more robust code. There are other uses of exceptions beyond error handling, for non-local control flow, but I doubt anyone would seriously advocate it’s good style. Like mutable state, I think using exceptions in ML programs is more of a patch over an incomplete design than a good practice. I’m sure there are exceptions to this rule, but I don’t think parsing is one of them.
exceptions in parsing
Let’s talk about a concrete example, the parser I plan to write for s-expressions. Without writing the implementation, here’s the type of a parser “factory”:
Like the lexical analyzer factory, this function takes a token reader and returns a parser, which is a new reader that produces abstract syntax trees (
sexprs) from the stream. Recall that readers return a value of type
('a, 'b) option, in other words either a
SOME (element, stream) or
NONE. In the case of I/O or general stream processing code, the
NONE value is simple: it just signifies that the stream is empty or the file handle has reached the EOF. (If this paragraph was incomprehensible to you, start here.)
As for the parser, what should happen in the case of a syntax error? Success is easy enough to understand, the result will be
SOME (ast, stream) but failure is more complicated than just
NONE. Reaching the end of the stream isn’t an exceptional circumstance, and shouldn’t represent an error (reaching the end EARLY is an error, but that just underlines my point, that
NONE clobbers any granularity we might want). Ideally the parser would report something more interesting than just the existence of an error, but also an error message and where it occurred. This is necessary to achieve my goal of:
1 2 3
There is an alternative to
'a option which isn’t part of the Standard ML basis but is easy to define and should be familiar to Haskell programmers:
('a, 'b) either.
This is basically a generalized union for types. Given two types
either constructs a type consisting of their union or sum. It’s useful when you want to return an optional value but want the
NONE variant to carry some extra data. You can imagine instead of
In our case, this is close but not quite enough. There are actually two possibilities other than success when trying to parse an ast from a stream: errors but also simply reaching the end of the stream. Reaching the end of a stream isn’t actually a error condition: it just means I’ve finished parsing a file or some other I/O stream (like a REPL being terminated with `). So it turns out that the following ternary type is better suited to parsing:
I’ve further specified the type to return a pair of resulting element and rest of stream in the case of success, and return a pair of error message and stream in the case of errors. Here are a few example values of this type for clarification:
And here’s a little example function that consumes values of this type:
1 2 3 4 5 6
an error-reporting parser for s-expressions
Our parser will take a stream and return an ML datatype representing s-expressions. This type is a simplified version of s-expressions that does not include “dotted pairs” because dotted pairs complicates the parser quite a bit without making it much more interesting.
sexpr type is polymorphic leaving room for any data to be attached to each and every node in the abstract syntax tree. In our case, this will simply be line and column number (
Pos.t) but you can imagine in a real compiler you might have more interesting data to attach, like scope or type information.
Finally, here is the complete source for the parser factory function, which takes a lexer and returns a parser:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
This function follows from the s-expression grammar:
1 2 3 4
The most interesting bit of this is probably the recursive call to
sexpr on line 15. Here, I recursively parse an s-expression appearing inside a list and unpack the result only if it was successfully parsed, otherwise I return the result which is likely an error; I don’t believe that this call could ever produce the EOF value but I haven’t actually proved it or tested the code heavily to be sure.
1 2 3 4 5 6 7 8 9 10
Unfortunately, s-expressions are so simple there are only really two interesting errors: unexpected end of streams, and unexpected closing parentheses.
1 2 3 4 5 6
However, since every node in the abstract syntax tree is annotated with a position in concrete syntax, an interpreter or compiler that consumed the tree could use that position for better runtime error reporting. Hopefully you can also imagine a type checker doing the same for type errors.
Printing a nice error message with a little caret to the offending character requires a little extra work. Since the stream at the point of the error is actually the rest of the stream, and doesn’t include any way to backtrack. For a majority of cases, printing the line number, then the line, and then the error message and column number would probably suffice. Getting back to the start of the offending line is the problem. It’s enough to just keep the stream at the beginning of the current line around until the reader reaches a new line:
1 2 3 4 5 6
This is a function that takes a character reader (
rdr) and returns a new reader that operates on a “richer” positional stream, one that also includes the stream state at the start of the current line. The anonymous function return take a triple of
s is the stream at the start of the line,
t is the current stream in the middle of the line, and
p is the position. Printing the error message is then a question of taking the stream at the start of the line, consuming the entire line from that stream, and formatting the output nicely, i.e. left as an exercise for the reader.
It’s worth noting that some programming language implementations take a pretty drastically different approach to this entire problem: pretty printing abstract syntax trees. Particularly relevent is SML/NJ, which uses this when printing type errors:
1 2 3 4 5
This is a simple example (note that the case expression’s branches are of different type) that yields the following error:
1 2 3 4 5 6 7
The concrete syntax appearing after “in expression:” has been reconstructed and pretty-printed (I think) from the abstract syntax tree, and in any case differs from the original in whitespace and parenthesization. This occasionally causes me some confusion when debugging really hairy type errors, with more complicated expressions. Interestingly, the position (“3.9-5.31” which the start and end of the expression, in the form “line.column-line.column”) DOES match up with the original input. So SML/NJ must keep that info around but doesn’t use it to go back to the original input.