Lexical analysis with readers and streams in ML - What I cannot create, I do not understand.

In my last post I described an interesting abstraction for streams in Standard ML, that I “discovered” after some trial and error. I say “discover” with scare quotes because I didn’t exactly make a revolutionary contribution; the documentation in the Basis describes this idea in detail and presents a few examples (see “Discussion” here). But it surprised me when I learned about it, so I thought it would be interesting to other people. Assuming that it is, I thought it might be helpful to walk through the small example I’ve been using as a proof of concept while experimenting with this code.

Before publishing the previous post, I wrote a lexer and a parser for s-expressions, using readers and streams. I chose this because it’s a very small language that’s easy to parse with recursive descent, but it’s also complicated enough that the code illustrates some of the issues you’d encounter writing larger parsers.

crash course on s-expressions

To refresh your memory (or introduce you if you haven’t used Lisp or Scheme) s-expressions are the textual representation of Lisp and Scheme source code and list literals. The subtle relationship between code and data is where Lisp draws much of its metaprogramming power, but that’s quite far from the scope of this post. If you’re not familiar with them, it’s safe to think of s-expressions as a lightweight notation for nested lists:

                
                      (1 2 (3 4) 5 (6 (7) 8) 9 0)

Just keep in mind that they can also be used to represent Lisp and Scheme code:

some Scheme code

                
                      (define (fact n)
  (if (= n 0) 1
      (* n (fact (- n 1)))))

To make it a bit more interesting, I decided to implement another feature of s-expressions: “dotted pairs” or “improper lists”. In Lisp, lists are constructed using the familar cons function, which can be used the same way it is in ML. In ML the type of cons (i.e. ::) is 'a * 'a list -> 'a list, and the compiler checks this for us. In Lisp, without static types, cons can be used the same way, and usually is: the first argument is something atomic, and the second argument is a list. The empty list or nil denotes the end of the list, just as in ML. But it’s possible to use it to construct arbitrary pairs, and for the second element to be atomic too, in which case that pair is printed with a dot:

                
                      > (cons 1 (cons 2 nil))
(1 2)
> (cons 1 2)
(1 . 2)
> (cons 1 (cons 2 (cons 3 4)))
(1 2 3 . 4)

reader

Recall that a reader is a peek function over a stream; it takes a stream and returns an element and new stream (wrapped in 'a option). Concretely, the type of a reader, as defined in StringCvt is:

                
                      type ('a,'b) reader = 'b -> ('a * 'b) option

…where 'b is the type of stream and 'a is the type of element in the stream. For instance, the type of reader for characters and strings is string -> (char * string) option.

tokens

Lexical analysis is the process of breaking input into words or tokens. It can also be thought of as assembling a higher level object (tokens) from a lower level one (characters). Lexers usually also remove comments and normalize whitespace, as appropriate for the programming language in question.

Our lexer will be a function named tokenize that operates on readers. Its argument is a character reader and it will return a token reader. The argument produces characters from some type of stream, and the returned reader will produce tokens from the same type of stream:

tokenize_type

                
                      val tokenize : (char, 'a) StringCvt.reader -> (token, 'a) StringCvt.reader

There are only four kinds of tokens in the little s-expression language: left (opening) parentheses, right (closing) parentheses, atoms, and a period (for dotted pairs and improper lists).

token_datatype

                
                      datatype token = LParen | RParen | Atom of string | Dot

Three of the tokens are single characters, while atoms are the only token longer than one character. The reader returned by tokenize will need to peek at least one character ahead in the stream to determine what to do. If that character is one of the single character tokens, all it needs to do is return that token along with the rest of the stream. If the character is some other non-whitespace character, it should read ahead until seeing some character not part of the atom, i.e. one of the other tokens or whitespace. Finally, it should just skip over the whitespace between tokens.

whitespace

Stripping whitespace from streams, with readers, is something that nearly every character and string processing program will want to do, so there’s a function already provided in the StringCvt module for just this purpose: skipWS. It was helpful for me to understand how to implement this myself, just to get a better intuition for streams and readers:

                
                      fun skipWS rdr s =
    case rdr s of
        NONE => s
      | SOME (x, s') =>
        if Char.isSpace x then
           skipWS rdr s'
        else s

This function is an example of a more general pattern for readers and streams: it’s primarily a function on streams, but since it needs a particular reader to work with any particular stream, it takes a reader as the first argument, a stream as the second argument, and it returns a stream. It discards characters from the stream with the provided reader until it sees one that is not whitespace, returning the stream at that point. For example:

                
                      - skipWS Reader.string "   foo" ;
val it = "foo" : string

Here, Reader.string is a string reader, i.e. a value of type (char,string) StringCvt.reader. The function skipWS will work with any type of character stream, whether backed by strings, substrings, lists, or files:

                
                      - skipWS Reader.list [#" ",#" ",#" ",#"f",#"o",#"o"] ;
val it = [#"f",#"o",#"o"] : char list
- Substring.string (skipWS Reader.substring (Substring.full "   foo")) ;
val it = "foo" : string

- let val out = TextIO.openOut "scratch"
      val _ = TextIO.output (out, "   foo")
      val _ = TextIO.flushOut out
      val fromFile : string -> TextIO.StreamIO.instream =
          TextIO.getInstream o TextIO.openIn
      val inp = fromFile "scratch"
in
   TextIO.StreamIO.inputAll (skipWS Reader.streamIO inp)
end ;

val it = ("foo",-) : TextIO.StreamIO.vector * ?.TextIO.instream

Again, Reader.list, Reader.substring, Reader.streamIO are all just readers for various concrete stream types. You can see all of the Reader struct in the source for this post. I’m tempted to build up a small structure with general reader-based stream processing code in the vein ofskipWS. But the implementations of the readers themselves are actually scattered throughout the Basis, for instance Substring.getc is actually a reader. If you’re just messing around, you can use those.

tokenize

With this under our belt, we can move on to writing the lexer proper:

tokenize_impl

                
                      fun tokenize rdr =
    fn s =>
       case rdr (StringCvt.skipWS rdr s) of
           NONE => NONE
         | SOME (#".", s') => SOME (Dot, s')
         | SOME (#"(", s') => SOME (LParen, s')
         | SOME (#")", s') => SOME (RParen, s')
         | SOME (_, s') =>
           case getAtom rdr (StringCvt.skipWS rdr s) of
               NONE => NONE
             | SOME (atom, s') => SOME (Atom atom, s')

Remember that this function takes a reader, and returns another reader. Since readers are functions that operate on streams, tokenize returns another function, so the body is just a lambda expression (fn s => on line 2).

The first thing we do is apply the skip whitespace function from StringCvt to the stream, then apply the reader (rdr (StringCvt.skipWS rdr s) on line 3). This gives us either NONE, indicating the stream is empty, or SOME with the peeked character and the rest of the stream. Note that since we remove the leading whitespace, we know this character is not a space. We either return a single character token or call a helper function named getAtom, which takes a reader and a character stream, and returns the next atom in the stream as a string, and the rest of the character stream.

This is another example of a common pattern, in this case for stream processing functions that take a reader and return another reader. The structure of the branches of the case expression is the important takeaway. If we try to peek into the first element of the stream and find nothing (i.e. NONE) then we should just return NONE, because the stream is empty. If there is an element in the stream, we transform that element somehow (in this case converting it to a token), and return the result of the transformation and the rest of the stream, wrapped in SOME.

getAtom_impl

                
                  
                    
                    
                      fun getAtom rdr s =
    let
       fun return [] _ = NONE
         | return acc s = SOME (String.implode (rev acc), s)

       fun getAtom' acc s =
           case rdr s of
                NONE => return acc s
              | SOME (#"(", rest) => return acc s
              | SOME (#")", rest) => return acc s
              | SOME (x, rest) => if Char.isSpace x then
                                     return acc s
                                  else getAtom' (x :: acc) rest
    in
       getAtom' [] s
    end

                    
                  

              

The function getAtom reads characters from the stream until it sees either a parentheses or a space. This is closer to the skipWS function than to tokenize, in terms of the shape of the code. Instead of discarding characters as skipWS does, it collects them and returns them as a string, along with the rest of the stream.

Together, these functions complete the lexical analyzer. Using a little helper function consume that repeatedly peeks from a stream and accumulates the elements into a list, we can run some ad-hoc tests:

tests

                
                      val [LParen, Atom "foo", RParen] = consume (tokenize string) "(foo)"
val [LParen, Atom "foo", Atom "bar", RParen] = consume (tokenize string) "(foo bar)"
val [LParen, Atom "foo", Dot, Atom "bar", RParen] = consume (tokenize string) "(foo . bar)"

That concludes this walkthrough. I hope it’s helpful, and demystifies lexical analysis a little bit (convincing you that you need neither ANTLR nor the Dragon book to write a little lexer) and also elaborates on the reader abstraction and how to make use of it. The code in this post, along with supporting boilerplate, is on Github, both as an Org-mode/Babel file and a tangled code-only .sml source file. If this whets your appetite, you could try implementing comments, which should be straightforward, or attach line and column number to the tokens, which is much harder. I may cover those in a later post, and later still cover the parser.