Previously, I hinted that attaching line and column numbers to a token stream would be a challenge. While it didn’t actually take the four months since I wrote that for me to figure out how to do it, getting it to work without forcing a rewrite of all my stream processing code was tricky. Inevitably, the result is a simple composition of higher order functions, but that belies how much of a pain it was to iron out.
Recall that the primary method of interacting with streams is via readers, which are simply peek
functions with a fancy name:
1
|
|
positions
The first thing we need in order to start tracking line and column numbers is some representation of a position in a file. While a pair would suffice, I think it’s a little clearer to use a record with line
and col
fields, but then wrap that in a module with an abstract type (i.e. with opaque signature ascription). This reduces the clutter when printing streams and elements at the REPL along with attached positions. There are of course many great reasons to use abstract types: modularity, encapsulation, etc… but I’m not that embarrassed to admit that getting the concise printed representation -
is half of why I use them most of the time.
1
|
|
There are a few helpers that will make our lives a bit easier, and also allow consumers of this type to actually do anything useful with it, like print an error message specifying line and column:
1 2 3 4 5 6 7 8 9 |
|
Some of these are standard for dealing with abstract types: you need to provide all of the ways a user would construct and manipulate values of this type. In this case we have a zero
value indicating the beginning of a stream, selectors (col
, line
) to get at the fields, applicative updaters (incrCol
, incrLine
) and a show
function. Bog standard stuff, but note along with this that using abstract types rather than “raw” tuples would let us extend this to include a third field like absolute character position without updating as much code.
positional streams
Finally, the more interesting bits are how we deal with streams and readers. If a position is a value of this new type, what’s the type of a positional stream, one that keeps track of line and column number? And how to implement a reader to go along with it? I settled on a simple, non-opaque representation: a pair of stream and position representing a positional stream, and a pair of element and position for the elements generated by the stream. This code sort of writes itself but there is some subtlety to the reader.
1 2 3 4 5 6 7 8 |
|
This function takes a character reader and returns a new reader that works the same way as the first except that it also attaches positions to each character and to the stream. Note that the new reader is responsible for updating the position in the stream, and so must unpack the value produced by the first reader and increment the line number if it’s a new line, otherwise update the column number.
tricky bits
So, now we have a way of constructing positional readers from raw character readers. So far, so good. Things break down a little when we attempt to use our growing body of general stream processing code. Take a look at the type of skipWS
, our example from earlier that strips whitespace off the front of a stream:
1
|
|
This function explicitly takes a raw character reader, and does then its work on the stream. But there’s no way we can use this with a positional reader and positional streams, since the type of element produced by a positional reader is not char
but char * Pos.t
. (where Pos.t
is the type of position, see the full source for the Pos
module)
I struggled with this problem for a little while. It seemed at first like there was no easy solution that would make it so you didn’t have to have a whole seperate suite of “general” functions that operated only on positional streams. After a while though, I realized that the solution was relatively simple, I had just been looking at it the wrong way. Rather than try to write a generic skipWS
function that operated on both raw and positional streams, I should have been trying to find a way to parameterize the skipWS
function in terms of the type of stream. This is something I’m already doing in a lot of stream processing code, since almost any function operating on streams needs access to the appropriate reader for that type of stream.
Take a look at the classic function dropWhile
, re-written to work with streams:
1 2 3 4 |
|
This function depends on access to a rdr
which decomposes the stream to it can be filtered and truncated. Here’s skipWS
written in terms of dropWhile
:
1
|
|
The secret was to realize that Char.isSpace
is the culprit. It operates on raw characters, when we need it to operate on characters plus positions. This should really be a parameter to skipWS
, letting the user decide what kind of stream they are stripping whitespace from. One could even imagine a situation where they weren’t operating on characters at all, but some other type with its own notion of what whitespace means.
At this point, the parameterization of these functions gets out of hand:
1
|
|
Personally, I think two arguments like this is the point where I reach for functors:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 |
|
I’m not too sure about the names of all these modules, but this is at least a start. I’m planning to experiment with this approach while writing a little parser that reports errors in as nice a way as possible, and continue experimenting with this as I begin to integrate something hopefully a little smoother around the edges into my ML compiler, for eventual use in reporting not only syntax errors but also type errors.