What I cannot create, I do not understand.

Just because you've implemented something, doesn't mean you understand it.

Polymorphic streams in ML

For a while now, I’ve been meaning to refactor my ML compiler’s front-end in order to support reading source code from files instead of only from strings in memory. Right now, the type of the lexical analyzer is string -> token list (a function from strings to lists of tokens) and I hoped it would be possible to change this to char stream -> token stream (a function from streams of characters to streams of tokens), where the 'a stream type would be some abstraction over sequences, such as strings but also file handles. Then the lexer could ignore the underlying details of the input, treating string input (for example in unit tests) the same way as file I/O (for example when compiling user programs). If they could both use the same type of stream (i.e. both instances of some type 'a stream), that would be sort of nice and elegant, plus allow the parser (which would consume a token stream) to share any general stream processing code I wrote for the lexer. So I started to look for a stream library for Standard ML.

Functional I/O

The Standard ML Basis has a functional I/O library based around streams. This module, TextIO.StreamIO, has a signature that should be quite natural for most functional programmers to use, and supports the kind of higher-order functions you might be used to using on lists, for instance takeWhile and filter, both common somewhat operations on input, which can be easily implemented to work with the StreamIO interface:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
(* Construct I/O streams from files ... *)
val fromFile : string -> TextIO.StreamIO.instream =
    TextIO.getInstream o TextIO.openIn

(* ... and from strings *)
val fromStr : string -> TextIO.StreamIO.instream =
    TextIO.getInstream o TextIO.openString

 local
    structure S = TextIO.StreamIO
 in

 (*
  * Reads chars from the stream s while p x is true
  *)
 fun takeWhile p s =
     let
        fun takeWhile' acc s =
            case S.input1 s of
                NONE => (String.implode (rev acc), s)
              | SOME (x, s') =>
                if p x then
                   takeWhile' (x :: acc) s'
                else (String.implode (rev acc), s)
     in
        takeWhile' [] s
     end

 (*
  * Filters stream s by predicate p
  *)
 fun filter p s =
     let
        fun filter' acc s =
            case S.input1 s of
                NONE => (String.implode (rev acc), s)
              | SOME (x, s') =>
                if p x then
                   filter' (x :: acc) s'
                else filter' acc s'
     in
        filter' [] s
     end
 end

 val s = fromStr "fooBar"
 val ("foo", s') = takeWhile Char.isLower s

 val t = fromStr "aBcDeF"
 val ("ace", t') = filter Char.isLower t

There are several problems with this code. First of all filter should really return another stream, but in this case it returns consumes all the input from the stream and returns it as a string. This could be pretty bad if the file was large, but even leaving issues of efficiency aside, it’s simply a bad interface. Ideally it should be possible to apply transformations like filter to streams and pass the resulting streams to other functions that can consume the streams, oblivious to the filtering.

Furthermore, we’re restricted to streams of characters, so if I used this for the input stream, the lexer would have to return some other stream type, not a stream based on the TextIO.StreamIO module. It may be possible to construct, from lower level pieces of the primitive I/O modules, a value of type TextIO.StreamIO.instream that consists of elements that are not characters, however after reading through the documentation and encountering this monster record, I realized this module is very I/O focused and while it might be possible it would be really hairy. Not yet discouraged, I decided to explore other options for a custom stream type that was polymorphic in its element, but still relatively simple.

Naive streams

Unfortunately, after some experimentation, I realized this is a more challenging task than I initially thought. While it certainly may be possible with some type gymnastics (I’m generally pretty conservative when claiming something is impossible in ML, after seeing the kinds of type hackery exhibited on the MLton wiki) it isn’t clear to me how to structure the code such that it passes the type checker. The problem seems to be that because I/O streams and strings would always produce values with concrete types (characters or bytes), there’s an inherent disagreement about the type of functions that operate on streams. For example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
(* This code does not compile *)
signature STREAM =
sig
   type 'a t
   val fromList : 'a list -> 'a t
   val fromIO : TextIO.StreamIO.instream -> char t
   val fromString : string -> char t
   val peek : 'a t -> ('a * 'a t) option
end

structure Stream : STREAM =
struct
   datatype 'a t = String of int * string
                 | List of 'a list
                 | IO of TextIO.StreamIO.instream
   val fromList = List
   val fromIO = IO
   fun fromString s = String (0, s)
   fun peek (String (idx, str)) =
       if idx < (String.size str) then
          SOME (String.sub (str, idx), String (idx + 1, str))
       else NONE
     | peek (List []) = NONE
     | peek (List (x::xs)) = SOME (x, List xs)
     | peek (IO ins) =
       case TextIO.StreamIO.input1 ins of
           NONE => NONE
         | SOME (x, ins') => SOME (x, IO ins')
end
1
2
3
4
stdIn:19.1-37.4 Error: value type in structure doesn't match signature spec
    name: peek
  spec:   'a ?.Stream.t -> ('a * 'a ?.Stream.t) option
  actual: char ?.Stream.t -> (char * char ?.Stream.t) option

The type of Stream.peek can’t be polymorphic since for some values it always returns characters. There are other, related problems with this code, specifically values constructed with Stream.IO and Stream.String are polymorphic, which becomes complicated by the value restriction.

The STREAM signature is inspired by a stream module from the OCaml standard library. Obviously, this module does type check in OCaml. How the OCaml version works is not clear to me, but after discovering multiple calls to Obj.magic (whose type is 'a -> 'b) in the OCaml implementation, specficially applied to the value read from the actual I/O stream, I realized that there might be a hole punched in the type system for this to work, and I clearly don’t know enough OCaml to understand it.

Reader

At some point in my search, I had a breakthrough when I stumbled onto the StringCvt module in the Standard ML Basis. It’s an innocuously named module related to converting strings into values, used for instance by Int.fmt and Int.scan. However, what is lexical analysis and parsing but converting strings (or char streams) into values? It turns out that there is an extremely simple type alias in this module that is a very solid foundation for a larger body of stream processing code: the lowly ('a, 'b) reader. This is just a name for the type 'b -> ('a * 'b) option, defined in code very simply as just:

1
type ('a,'b) reader = 'b -> ('a * 'b) option

From the description: “The type of a reader producing values of type ‘a from a stream of type ‘b. A return value of SOME(a,b) corresponds to a value a scanned from the stream, plus the remainder b of the stream. A return value of NONE indicates that no value of the correct type could be scanned from the prefix of the stream.”

This should be quite clear, but let me reiterate: A value of type reader is not a stream, but a function of one argument, which is a stream. If there are no more elements left in the stream, the return value is NONE. If there are elements, the return value is SOME (x, s'), where x is the next element in the stream, and s' is the rest of the stream after consuming x.

The insight is that this allows you to more easily compose versions of higher-order functions like map, filter and friends that operate on streams.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
structure Reader =
struct

local
   open String
   open StringCvt
in

val list : ('a, 'a list) reader =
 fn [] => NONE
  | (x::xs) => SOME (x, xs)

val string : (char, string) reader =
 fn "" => NONE
  | s => SOME (sub (s, 0), substring (s, 1, size s - 1))


val instream : (char, TextIO.StreamIO.instream) reader =
    TextIO.StreamIO.input1

fun map (f : 'a -> 'c) (rdr : ('a, 'b) reader) : ('c, 'b) reader =
    fn s =>
       case rdr s of
           NONE => NONE
         | SOME (x, s') => SOME (f x, s')

fun filter (p : 'a -> bool) (rdr : ('a, 'b) reader) : ('a, 'b) reader =
    let
       fun rdr' s =
           case rdr s of
               NONE => NONE
             | SOME (x, s') =>
               if p x then
                  SOME (x, s')
               else rdr' s'
    in
       rdr'
    end
end

end

The type of map for streams is slightly different than the usual map over lists. Both of them take a transformation function with the type 'a -> 'b, transforming values of type 'a to 'b. The regular map over lists then takes a list of 'a and returns a list of 'b. You might think that map over streams would take a stream of 'a and return a stream of 'b but with the reader abstraction the type is a little bit different: instead it takes a reader producing values of type 'a from a stream of type 'c and returns a reader that produces values of type 'b from a stream of type 'c. So it’s a function that operates on readers rather than streams, but this turns out to work quite well, I think. Filter works the same way.

1
2
3
4
5
val toUpper = Reader.map Char.toUpper Reader.string
val SOME (#"F", "oo") = toUpper "foo"

val upperOnly = Reader.filter Char.isUpper Reader.string
val SOME (#"B", "ar") = upperOnly "fooBar"

What’s next

I’ve started to write a few small lexers and parsers using readers, and I think that it’s a nice way to abstract over the possible input values that string and character processing programs consume. Of course, good sequence abstractions abound in programming languages and libraries, and exploring other approaches to tackling this problem is still an open question for me to explore further. For instance, while I’m somewhat familiar with Clojure’s seq protocol, I do not have a strong intuition for all of the details. It would be interesting to see how Haskell in particular deals with this, as it’s closer to ML, and of course understanding OCaml’s implementation is high on my list. Implementing lazy streams using the reader another thing I have yet to try. Clearly, I’m only just scratching the surface, but in the mean time, I’m ready to move on with my refactoring.