What I cannot create, I do not understand.

Just because you've implemented something, doesn't mean you understand it.

Positional Lexers

… in ML, of course

In my last post, for some reason I decided to drop “in ML” from the title, unlike previous posts. I’m not sure why I did that, or whether I like it, since a lot of these blog posts and accompanying code is kind of ML specific (and I am awfully fond of ML), but at the same time I’d like to think these posts are more generally useful to non-ML programmers interested in amateur compiler hacking. For now, I guess “in ML” will just be implicit, until I start posting about work in another language.

positional lexical analyzers

Moving on, I wanted to address something I neglected to cover in my last post, which is the process of turning a positional character reader into a lexical analyzer that produces a stream of tokens with their position in file. This is the second prerequisite after positional character readers to actually handling and reporting syntactic errors (and other errors, both static and dynamic).

what’s the point of all this

I’ve been steaming along with these compiler front-end posts without a lot of higher level commentary, for instance never addressing why I might be bothering with any of this stuff, or where the hell I’m going with it all. So here’s a brief rundown of my motivations and where I want to go in the next half dozen or so blog posts.

Last summer (while on a mini-sabattical at Hacker School) I wrote most of a front end for an ML compiler, nicknamed maml. It consists of a lexer, parser, type checker (with Hindley-Milner type inference), and a naive pattern match compiler. Eventually, I hope it will become a full blown implementation of a mini ML-like language, not all the way to a real Standard ML but a bit more than a toy interpreter. One of my goals for this project was always to explore aspects of compiler construction I’ve avoided, and one of those that I wanted to understand better was parsing.

The parser in maml is a hand-written recursive descent parser (with an embedded Pratt-style expression parser, which deserves its own post to be honest). Right now the parser does a poor job reporting syntax errors, and the same goes for the type checker reporting type errors (and for that matter the pattern match compiler reporting non-exhaustive match warnings). Exploring how to do a good job of handling and reporting errors is where I am right now. Part of that involved refactoring the front end to deal in streams instead of the naive approach of handling lists, and so that’s why I’ve focused on streams in these posts so much. Handling and reporting errors during parsing will be the next post I write, after this one. Eventually I’ll find the time to return to maml and impart what I’ve found into a refactor of the parser, and eventually the type checker.

To put it succinctly, this is what I want my parser to do when encountering bad input:

1
2
3
Syntax error:
val x = if y then z
~~~~~~~~~~~~~~~~~~~~^ expected 'else'

back to the lexer

Ok, with that out of the way, I hope that the thread is clearer and that the end goals better defined. Recall that previously we had a character reader (refer back to this post for background on readers) that produced characters with attached column and line numbers:

1
val reader: (char * Pos.t,'a * Pos.t) StringCvt.reader = (* ... *)

Where the type Pos.t is:

1
type t = {col: int, line: int}

The Pos module defines a small suite of related functions for operating on values of this position type. The plan is to write a function that, given a positional character reader, returns a token reader that attaches line and column numbers to each token in the stream. For a suitable token type Lexer.t, we want:

1
val makeLexer: (char * Pos.t,'a * Pos.t) StringCvt.reader -> (Lexer.t * Pos.t,'a * Pos.t) = (* ... *)

That is to say, the function returned produces positional tokens from a positional stream of type 'a.

defining a token type

Since it’s a nice small example, we’ll steal the Scheme token type I used in a previous post on lexers and streams. Here it is again:

1
datatype token = LParen | RParen | Atom of string

I’ve removed the Dot constructor, for reasons that may become clear in a future post. For now trust me that it turns out to unnecessarily complicate things in parsing, and I want to use this example as the foundation for posts on parsing.

Specific values of this type will need to carry additional data: line and column numbers, or positions. There are two ways you could do this. One way would be for your lexer to produce pairs of token and position, rather than just tokens. The other is to modify this type so that each constructor accepts another argument, the position. During lexing, this matters less, but it turns out that in parsing the second is much easier, I think. This problem in general isn’t entirely trivial.

Since we have existing code that treats positional readers as those that produce pairs of element and position, for the lexer we’ll go with this. The parser will likely use the other approach.

the actual lexer

The makeLexer function isn’t that difficult to write. It doesn’t differ much from the previous lexer (although keep in mind we aren’t dealing with dotted pairs any more), and the biggest change is really just the bookkeeping of using the right position at the time of token construction:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
fun makeLexer (rdr : (char * Pos.t, 'a) reader) : (token * Pos.t, 'a) reader =
    let
       fun skipWS rdr s =
           case rdr s of
               NONE => s
             | SOME ((#" ", _), s') => skipWS rdr s'
             | SOME ((_   , _), s') => s
    in
       fn s =>
          case rdr (skipWS rdr s) of
              NONE => NONE
            | SOME ((#"(", p), s') => SOME ((LParen, p), s')
            | SOME ((#")", p), s') => SOME ((RParen, p), s')
            | SOME ((_, p), s') =>
              case getAtom (rdr, skipWS rdr s) of
                  NONE => NONE
                | SOME (atom, s') => SOME ((Atom atom, p), s')
    end

The function getAtom that extracts an atom as a string from the stream is largely unchanged, but there is again some extra bookkeeping, in this case updating the pattern matching to just ignore the position that’s attached to every character.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
fun getAtom (rdr : ((char * Pos.t), 'b) reader, s : 'b) : (string * 'b) option =
    let
       fun done [] _ = NONE
         | done acc s = SOME (String.implode (rev acc), s)

       fun getAtom' acc s =
           case rdr s of
                NONE => done acc s
              | SOME ((#"(", _), rest) => done acc s
              | SOME ((#")", _), rest) => done acc s
              | SOME ((x,    _), rest) =>
                if Char.isSpace x then
                   done acc s
                else getAtom' (x :: acc) rest
    in
       getAtom' [] s
    end

Along with the supporing infrastructure, we can now use this code to tokenize any stream (with a suitable reader implementation) into tokens with character position attached to them. For instance, given a regular character reader that operates on strings (this is inefficient, but fine for our purposes):

1
2
3
4
fun getc1 "" = NONE
  | getc1 s  = SOME (String.sub (s, 0), String.substring (s, 1, size s - 1))

val SOME (#"f", "oo") = getc1 "foo"

And a positional reader constructed from our string reader:

1
2
3
4
5
6
7
8
9
val getc2 = Pos.reader getc1

val SOME ((#"f", p), ("oo", p')) = getc2 (Pos.stream "foo")

val 1 = Pos.line p
val 0 = Pos.col p

val 1 = Pos.line p'
val 1 = Pos.col p'

We can finally construct a lexer, that attaches line and column numbers to tokens that it produces. This is then used by the parser to report errors, as well as attach the position of various syntax elements to the abstract syntax tree. I’ll try to cover that in my next post.

1
2
3
4
5
val lex   = Lexer.makeLexer getc2

val SOME ((LParen, p), ("foo)", p')) = lex (Pos.stream "(foo)")
val 1 = Pos.line p
val 0 = Pos.col p