Previously, I hinted that attaching line and column numbers to a token stream would be a challenge. While it didn’t actually take the four months since I wrote that for me to figure out how to do it, getting it to work without forcing a rewrite of all my stream processing code was tricky. Inevitably, the result is a simple composition of higher order functions, but that belies how much of a pain it was to iron out.
Recall that the primary method of interacting with streams is via readers, which are simply
peek functions with a fancy name:
The first thing we need in order to start tracking line and column numbers is some representation of a position in a file. While a pair would suffice, I think it’s a little clearer to use a record with
col fields, but then wrap that in a module with an abstract type (i.e. with opaque signature ascription). This reduces the clutter when printing streams and elements at the REPL along with attached positions. There are of course many great reasons to use abstract types: modularity, encapsulation, etc… but I’m not that embarrassed to admit that getting the concise printed representation
- is half of why I use them most of the time.
There are a few helpers that will make our lives a bit easier, and also allow consumers of this type to actually do anything useful with it, like print an error message specifying line and column:
1 2 3 4 5 6 7 8 9
Some of these are standard for dealing with abstract types: you need to provide all of the ways a user would construct and manipulate values of this type. In this case we have a
zero value indicating the beginning of a stream, selectors (
line) to get at the fields, applicative updaters (
incrLine) and a
show function. Bog standard stuff, but note along with this that using abstract types rather than “raw” tuples would let us extend this to include a third field like absolute character position without updating as much code.
Finally, the more interesting bits are how we deal with streams and readers. If a position is a value of this new type, what’s the type of a positional stream, one that keeps track of line and column number? And how to implement a reader to go along with it? I settled on a simple, non-opaque representation: a pair of stream and position representing a positional stream, and a pair of element and position for the elements generated by the stream. This code sort of writes itself but there is some subtlety to the reader.
1 2 3 4 5 6 7 8
This function takes a character reader and returns a new reader that works the same way as the first except that it also attaches positions to each character and to the stream. Note that the new reader is responsible for updating the position in the stream, and so must unpack the value produced by the first reader and increment the line number if it’s a new line, otherwise update the column number.
So, now we have a way of constructing positional readers from raw character readers. So far, so good. Things break down a little when we attempt to use our growing body of general stream processing code. Take a look at the type of
skipWS, our example from earlier that strips whitespace off the front of a stream:
This function explicitly takes a raw character reader, and does then its work on the stream. But there’s no way we can use this with a positional reader and positional streams, since the type of element produced by a positional reader is not
char * Pos.t. (where
Pos.t is the type of position, see the full source for the
I struggled with this problem for a little while. It seemed at first like there was no easy solution that would make it so you didn’t have to have a whole seperate suite of “general” functions that operated only on positional streams. After a while though, I realized that the solution was relatively simple, I had just been looking at it the wrong way. Rather than try to write a generic
skipWS function that operated on both raw and positional streams, I should have been trying to find a way to parameterize the
skipWS function in terms of the type of stream. This is something I’m already doing in a lot of stream processing code, since almost any function operating on streams needs access to the appropriate reader for that type of stream.
Take a look at the classic function
dropWhile, re-written to work with streams:
1 2 3 4
This function depends on access to a
rdr which decomposes the stream to it can be filtered and truncated. Here’s
skipWS written in terms of
The secret was to realize that
Char.isSpace is the culprit. It operates on raw characters, when we need it to operate on characters plus positions. This should really be a parameter to
skipWS, letting the user decide what kind of stream they are stripping whitespace from. One could even imagine a situation where they weren’t operating on characters at all, but some other type with its own notion of what whitespace means.
At this point, the parameterization of these functions gets out of hand:
Personally, I think two arguments like this is the point where I reach for functors:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
I’m not too sure about the names of all these modules, but this is at least a start. I’m planning to experiment with this approach while writing a little parser that reports errors in as nice a way as possible, and continue experimenting with this as I begin to integrate something hopefully a little smoother around the edges into my ML compiler, for eventual use in reporting not only syntax errors but also type errors.