For a while now, I’ve been meaning to refactor my ML compiler’s front-end in order to support reading source code from files instead of only from strings in memory. Right now, the type of the lexical analyzer is
string -> token list (a function from strings to lists of tokens) and I hoped it would be possible to change this to
char stream -> token stream (a function from streams of characters to streams of tokens), where the
'a stream type would be some abstraction over sequences, such as strings but also file handles. Then the lexer could ignore the underlying details of the input, treating string input (for example in unit tests) the same way as file I/O (for example when compiling user programs). If they could both use the same type of stream (i.e. both instances of some type
'a stream), that would be sort of nice and elegant, plus allow the parser (which would consume a
token stream) to share any general stream processing code I wrote for the lexer. So I started to look for a stream library for Standard ML.
The Standard ML Basis has a functional I/O library based around streams. This module, TextIO.StreamIO, has a signature that should be quite natural for most functional programmers to use, and supports the kind of higher-order functions you might be used to using on lists, for instance
filter, both common somewhat operations on input, which can be easily implemented to work with the StreamIO interface:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
There are several problems with this code. First of all
filter should really return another stream, but in this case it returns consumes all the input from the stream and returns it as a string. This could be pretty bad if the file was large, but even leaving issues of efficiency aside, it’s simply a bad interface. Ideally it should be possible to apply transformations like
filter to streams and pass the resulting streams to other functions that can consume the streams, oblivious to the filtering.
Furthermore, we’re restricted to streams of characters, so if I used this for the input stream, the lexer would have to return some other stream type, not a stream based on the TextIO.StreamIO module. It may be possible to construct, from lower level pieces of the primitive I/O modules, a value of type
TextIO.StreamIO.instream that consists of elements that are not characters, however after reading through the documentation and encountering this monster record, I realized this module is very I/O focused and while it might be possible it would be really hairy. Not yet discouraged, I decided to explore other options for a custom stream type that was polymorphic in its element, but still relatively simple.
Unfortunately, after some experimentation, I realized this is a more challenging task than I initially thought. While it certainly may be possible with some type gymnastics (I’m generally pretty conservative when claiming something is impossible in ML, after seeing the kinds of type hackery exhibited on the MLton wiki) it isn’t clear to me how to structure the code such that it passes the type checker. The problem seems to be that because I/O streams and strings would always produce values with concrete types (characters or bytes), there’s an inherent disagreement about the type of functions that operate on streams. For example:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
1 2 3 4
The type of
Stream.peek can’t be polymorphic since for some values it always returns characters. There are other, related problems with this code, specifically values constructed with
Stream.String are polymorphic, which becomes complicated by the value restriction.
The STREAM signature is inspired by a stream module from the OCaml standard library. Obviously, this module does type check in OCaml. How the OCaml version works is not clear to me, but after discovering multiple calls to
Obj.magic (whose type is
'a -> 'b) in the OCaml implementation, specficially applied to the value read from the actual I/O stream, I realized that there might be a hole punched in the type system for this to work, and I clearly don’t know enough OCaml to understand it.
At some point in my search, I had a breakthrough when I stumbled onto the StringCvt module in the Standard ML Basis. It’s an innocuously named module related to converting strings into values, used for instance by
Int.scan. However, what is lexical analysis and parsing but converting strings (or char streams) into values? It turns out that there is an extremely simple type alias in this module that is a very solid foundation for a larger body of stream processing code: the lowly
('a, 'b) reader. This is just a name for the type
'b -> ('a * 'b) option, defined in code very simply as just:
From the description: “The type of a reader producing values of type ‘a from a stream of type ‘b. A return value of SOME(a,b) corresponds to a value a scanned from the stream, plus the remainder b of the stream. A return value of NONE indicates that no value of the correct type could be scanned from the prefix of the stream.”
This should be quite clear, but let me reiterate: A value of type
reader is not a stream, but a function of one argument, which is a stream. If there are no more elements left in the stream, the return value is
NONE. If there are elements, the return value is
SOME (x, s'), where
x is the next element in the stream, and
s' is the rest of the stream after consuming
The insight is that this allows you to more easily compose versions of higher-order functions like map, filter and friends that operate on streams.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41
The type of
map for streams is slightly different than the usual
map over lists. Both of them take a transformation function with the type
'a -> 'b, transforming values of type
'b. The regular
map over lists then takes a list of
'a and returns a list of
'b. You might think that
map over streams would take a stream of
'a and return a stream of
'b but with the reader abstraction the type is a little bit different: instead it takes a reader producing values of type
'a from a stream of type
'c and returns a reader that produces values of type
'b from a stream of type
'c. So it’s a function that operates on readers rather than streams, but this turns out to work quite well, I think. Filter works the same way.
1 2 3 4 5
I’ve started to write a few small lexers and parsers using readers, and I think that it’s a nice way to abstract over the possible input values that string and character processing programs consume. Of course, good sequence abstractions abound in programming languages and libraries, and exploring other approaches to tackling this problem is still an open question for me to explore further. For instance, while I’m somewhat familiar with Clojure’s
seq protocol, I do not have a strong intuition for all of the details. It would be interesting to see how Haskell in particular deals with this, as it’s closer to ML, and of course understanding OCaml’s implementation is high on my list. Implementing lazy streams using the reader another thing I have yet to try. Clearly, I’m only just scratching the surface, but in the mean time, I’m ready to move on with my refactoring.