Or, more notes on recursive descent
Previously, I wrote about pitfalls I encountered while writing a recursive descent parser for a little interpreter. I had planned to write about porting that procedural, impure ML parser to Haskell using simple parser combinators, and then with the Parsec library.
However, since writing that post, I discovered a severe bug in the parser. In this post, I’ll describe the bug, how I fixed it and what I learned in the process. Haskell and Parsec will have to wait for now (tl;dr Parsec is pretty neat, but it’s easy to rely on infinite lookahead instead of carefully writing LL(1) parsers).
Parsing function application
I’ve adapted the prototype I described before into the front-end for an ML compiler I’ve been working on, since at its core ML is syntactically similar (semantically similar as well of course) to the lambda calculus. Recall that the parser should enforce the usual rules of precedence and associativity in arithmetic expressions: 1 + 2 * 3
should be parsed as 1 + (2 * 3)
. This is relatively well-trodden ground, and most compiler texts address this by introducing precedence levels into the grammar.
In a full-fledged programming language like ML, there are other expressions that must have even higher precedence than multiplication, but aren’t usually present in the simple calculator language, for instance function application (or function calls). In ML, function application is simply one expression followed by another: f x
instead of f(x)
as is the case in most languages with syntax derived from C. This poses a problem for a recursive descent parser, since the resulting grammar has a production that looks something like:
1 2 3 4 5 |
|
This was the grammar I wrote in order to add function application to the parser I started with that did arithmetic only, and it grew out of my belief that as a functional, expression oriented language, ML should allow any expression to be called as a function in an application. For instance, the following are all legal ML programs:
1 2 |
|
However, the grammar poses a real problem for the parser writer: the production expr -> expr expr
is left-recursive. This production is a pain to fix, and in fact I wasn’t able to figure out how to remove the recursion, since the usual algorithm breaks the production into two pieces but the result in this case still has a left recursion.
The intention with this grammar is that the following snippet of ML should be parsed as a curried function application, allowing the usual lambda calculus style of supporting multiple arguments as in ML and Haskell:
1 2 3 |
|
What I decided was to attempt to fix the problem of parsing function application by ignoring support for curried functions, and add the following production to the grammar:
1 2 3 4 5 |
|
This grammar no longer has the problem that the previous one has, since exprs
is not left-recursive. The exprs
production can be implemented straightforwardly in the existing recursive descent parser:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
|
On line 13, this code parses a single expression, then proceeds to look ahead one token (via has ()
, which returns true is there are remaining tokens and peek ()
which returns the next token but does not advance the input). It only parses a second expression if that token is in the set FIRST(expr)
(tokens that could possible begin an expr
). If the token is not in FIRST(expr)
, this procedure simply returns the first expression it parsed (the function FIRSTexpr
returns an option
value rather than true or false because it’s returning the position record which keeps track of line and column number for tokens and AST nodes, for error handling, but it’s not important for this discussion).
This technique works, sort of. It does correctly parse function application expressions and doesn’t support curried functions, which is what we expected. However the new grammar is deeply flawed, and here’s a bit of code that shows how wrong it is:
1
|
|
This code should be parsed as:
1
|
|
But because I added the new production to the top of the grammar (by which I mean the top of the precedence hiearchy), this caused function application to have the lowest precedence, rather than the highest, which is the usual precedence assigned to application in ML and most languages I’m aware of with this syntax. So with the parser I had, this code is parsed as:
1
|
|
Pretty bad!
The Definition of Standard ML
I started researching this problem, because it occured to me that it might very difficult or even impossible to parse ML with recursive descent. I didn’t find any evidence that this was the case, and in fact I stumbled upon a paper written by Andrew Appel and David MacQueen about the development of SML/NJ, mentioning that, “Early in the development of the compiler we used a hand-written lexical analyzer and a recursive-descent parser.” This was encouraging, to say the least. I also found a few discussions about ML’s syntax that mentioned features that make it not LL(k), however none of them specifically mentioned function application, instead referencing completely separate issues like distinguishing datatype constructors from variables and dealing with user-defined infix operators.
Eventually I hit upon a few pages in the appendix of The Definition of Standard ML which talk about syntax. I had known about The Definition but had avoided opening it simply because I assumed it would be far too dense and difficult for me to make much use of. I also (mistakenly) assumed it wouldn’t necessarily get into such hairy, pedestrian details of implementation like how to parse source code. As it turns out, not only are some parts of it pretty accessible, but there is a pretty thorough coverage of syntactic issues that doesn’t go as far as defining a grammar suitable for parsing with no modification, but does reveal some interesting properties of ML’s syntax, including how to resolve the problem I was facing with function application.
Appendix B begins with a specification of several classes of phrases that make up Standard ML programs: AtExp
, AppExp
, InfExp
, and Exp
. This distinction is subtle but specifically addresses the issue with the grammar I had put together above. Here is the grammar in the appendix, simplified a little bit for clarity:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
|
There is obviously more to the full grammar, but this snippet illustrates something important, namely that the following program is syntactically illegal in Standard ML:
1
|
|
Essentially, there is a relatively restricted class of expressions that are allowed to be in the function or argument position of an application without being parenthesized: simple expressions like variables, constants, records, selectors (e.g. #foo
), tuples, lists, and let
expressions. All other expressions must be parenthesized to be either applied as functions or passed as arguments to functions.
In hindsight, this makes a lot of sense, since there are some “open-ended” expressions that can’t possibly occur in the function position: if
, while
, case
and fn
all lack terminating end
keywords unlike let
, so if they were allowed to be applied, any expression occuring after them would be ambiguous, and could be part of the body or considered an argument. It’s not as clear to me why the program above is illegal, but requiring both the function an argument to be “atomic” expressions might make for a cleaner grammar.
Adding this separation to my grammar and to my parser wasn’t difficult, and now function applications would be parsed correctly with the highest precedence when mixed with infix operators. Here’s the code for the expr
production:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 |
|
It peeks one token ahead (case peek () of
on line 3) and then depending on the token found, either parses an if
, fn
, or match
expression (match
in my compiler is the same as case
in SML, and match
in OCaml, the reason for using match
instead of case
is tied up in the work I’ve been doing to compile pattern matching, which deserves a whole separate post), or defers to parse an infix expression with infexp
.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 |
|
The code to parse let
expressions is still just as hairy as before (and definitely could or should be improved!) but the important part here is the distinction between appexp
, atexp
, infexp
, and expr
, which represent the same classes of phrases described in The Definition. The result is that this code not only correctly parses function application with higher precedence than all the infix operators, but also correctly parses curried function application (as left-associative).
The lessons I learned here were simple: (a) do some research and (b) sometimes there isn’t a fully general, elegant solution (in this case, the production expr -> expr expr
) and sometimes introducing a level of complexity (the different classes of expression) is necessary.
The complete parser code is here. I hope to follow this post up with another one detailing changes I made to the infix expression parser, and how the technique I ended up using turned out to be a perfect fit for ML in another part of syntax that is hairy to parse with recursive descent.