Or, more notes on recursive descent
Previously, I wrote about pitfalls I encountered while writing a recursive descent parser for a little interpreter. I had planned to write about porting that procedural, impure ML parser to Haskell using simple parser combinators, and then with the Parsec library.
However, since writing that post, I discovered a severe bug in the parser. In this post, I’ll describe the bug, how I fixed it and what I learned in the process. Haskell and Parsec will have to wait for now (tl;dr Parsec is pretty neat, but it’s easy to rely on infinite lookahead instead of carefully writing LL(1) parsers).
Parsing function application
I’ve adapted the prototype I described before into the front-end for an ML compiler I’ve been working on, since at its core ML is syntactically similar (semantically similar as well of course) to the lambda calculus. Recall that the parser should enforce the usual rules of precedence and associativity in arithmetic expressions:
1 + 2 * 3 should be parsed as
1 + (2 * 3). This is relatively well-trodden ground, and most compiler texts address this by introducing precedence levels into the grammar.
In a full-fledged programming language like ML, there are other expressions that must have even higher precedence than multiplication, but aren’t usually present in the simple calculator language, for instance function application (or function calls). In ML, function application is simply one expression followed by another:
f x instead of
f(x) as is the case in most languages with syntax derived from C. This poses a problem for a recursive descent parser, since the resulting grammar has a production that looks something like:
1 2 3 4 5
This was the grammar I wrote in order to add function application to the parser I started with that did arithmetic only, and it grew out of my belief that as a functional, expression oriented language, ML should allow any expression to be called as a function in an application. For instance, the following are all legal ML programs:
However, the grammar poses a real problem for the parser writer: the production
expr -> expr expr is left-recursive. This production is a pain to fix, and in fact I wasn’t able to figure out how to remove the recursion, since the usual algorithm breaks the production into two pieces but the result in this case still has a left recursion.
The intention with this grammar is that the following snippet of ML should be parsed as a curried function application, allowing the usual lambda calculus style of supporting multiple arguments as in ML and Haskell:
1 2 3
What I decided was to attempt to fix the problem of parsing function application by ignoring support for curried functions, and add the following production to the grammar:
1 2 3 4 5
This grammar no longer has the problem that the previous one has, since
exprs is not left-recursive. The
exprs production can be implemented straightforwardly in the existing recursive descent parser:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
On line 13, this code parses a single expression, then proceeds to look ahead one token (via
has (), which returns true is there are remaining tokens and
peek () which returns the next token but does not advance the input). It only parses a second expression if that token is in the set
FIRST(expr) (tokens that could possible begin an
expr). If the token is not in
FIRST(expr), this procedure simply returns the first expression it parsed (the function
FIRSTexpr returns an
option value rather than true or false because it’s returning the position record which keeps track of line and column number for tokens and AST nodes, for error handling, but it’s not important for this discussion).
This technique works, sort of. It does correctly parse function application expressions and doesn’t support curried functions, which is what we expected. However the new grammar is deeply flawed, and here’s a bit of code that shows how wrong it is:
This code should be parsed as:
But because I added the new production to the top of the grammar (by which I mean the top of the precedence hiearchy), this caused function application to have the lowest precedence, rather than the highest, which is the usual precedence assigned to application in ML and most languages I’m aware of with this syntax. So with the parser I had, this code is parsed as:
The Definition of Standard ML
I started researching this problem, because it occured to me that it might very difficult or even impossible to parse ML with recursive descent. I didn’t find any evidence that this was the case, and in fact I stumbled upon a paper written by Andrew Appel and David MacQueen about the development of SML/NJ, mentioning that, “Early in the development of the compiler we used a hand-written lexical analyzer and a recursive-descent parser.” This was encouraging, to say the least. I also found a few discussions about ML’s syntax that mentioned features that make it not LL(k), however none of them specifically mentioned function application, instead referencing completely separate issues like distinguishing datatype constructors from variables and dealing with user-defined infix operators.
Eventually I hit upon a few pages in the appendix of The Definition of Standard ML which talk about syntax. I had known about The Definition but had avoided opening it simply because I assumed it would be far too dense and difficult for me to make much use of. I also (mistakenly) assumed it wouldn’t necessarily get into such hairy, pedestrian details of implementation like how to parse source code. As it turns out, not only are some parts of it pretty accessible, but there is a pretty thorough coverage of syntactic issues that doesn’t go as far as defining a grammar suitable for parsing with no modification, but does reveal some interesting properties of ML’s syntax, including how to resolve the problem I was facing with function application.
Appendix B begins with a specification of several classes of phrases that make up Standard ML programs:
Exp. This distinction is subtle but specifically addresses the issue with the grammar I had put together above. Here is the grammar in the appendix, simplified a little bit for clarity:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
There is obviously more to the full grammar, but this snippet illustrates something important, namely that the following program is syntactically illegal in Standard ML:
Essentially, there is a relatively restricted class of expressions that are allowed to be in the function or argument position of an application without being parenthesized: simple expressions like variables, constants, records, selectors (e.g.
#foo), tuples, lists, and
let expressions. All other expressions must be parenthesized to be either applied as functions or passed as arguments to functions.
In hindsight, this makes a lot of sense, since there are some “open-ended” expressions that can’t possibly occur in the function position:
fn all lack terminating
end keywords unlike
let, so if they were allowed to be applied, any expression occuring after them would be ambiguous, and could be part of the body or considered an argument. It’s not as clear to me why the program above is illegal, but requiring both the function an argument to be “atomic” expressions might make for a cleaner grammar.
Adding this separation to my grammar and to my parser wasn’t difficult, and now function applications would be parsed correctly with the highest precedence when mixed with infix operators. Here’s the code for the
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37
It peeks one token ahead (
case peek () of on line 3) and then depending on the token found, either parses an
match expression (
match in my compiler is the same as
case in SML, and
match in OCaml, the reason for using
match instead of
case is tied up in the work I’ve been doing to compile pattern matching, which deserves a whole separate post), or defers to parse an infix expression with
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
The code to parse
let expressions is still just as hairy as before (and definitely could or should be improved!) but the important part here is the distinction between
expr, which represent the same classes of phrases described in The Definition. The result is that this code not only correctly parses function application with higher precedence than all the infix operators, but also correctly parses curried function application (as left-associative).
The lessons I learned here were simple: (a) do some research and (b) sometimes there isn’t a fully general, elegant solution (in this case, the production
expr -> expr expr) and sometimes introducing a level of complexity (the different classes of expression) is necessary.
The complete parser code is here. I hope to follow this post up with another one detailing changes I made to the infix expression parser, and how the technique I ended up using turned out to be a perfect fit for ML in another part of syntax that is hairy to parse with recursive descent.