Lecture 9 — 2015-09-28

Extending lambda calculus interpreters; lexing

This lecture is written in literate Haskell; you can download the raw source.

  • Whereas, the homeworks are hard on long,
  • Whereas, some of the homework material isn’t taught until the day before it’s due,
  • Whereas, the class desires it,
  • Be it resolved that homeworks (from HW05 on) are due on the Sunday after their release.

As mentioned in class, this is not an invitation to procrastination. If you would like me to send you a reminder email on Friday mornings—work now, lest your weekend be destroyed—I will be happy to do so.

Interpreters, all over again

We rushed through interpreters last time. Let’s slow down for one more go round. We slowly did some evaluations on the board to see how the environment in a closure holds on to variable bindings, rather than substituting function arguments directly.

Then we extended our interpreter with pairs, as shown below.

import qualified Data.Map as Map
import Data.Map (Map, (!))
import Data.Char

type Id = String

data LCExpr = 
    LCVar Id
  | LCApp LCExpr LCExpr
  | LCLam Id LCExpr
  | LCPair LCExpr LCExpr
  | LCFst LCExpr
  | LCSnd LCExpr
  deriving (Show,Eq)

data LCValue = 
    Closure Id LCExpr Env 
  | Pair LCValue LCValue
  deriving (Eq,Show)

type Env = Map Id LCValue

extend :: Env -> Id -> LCValue -> Env
extend env x v = Map.insert x v env

evalLC :: Env -> LCExpr -> LCValue
evalLC env (LCVar x) = env ! x
evalLC env (LCLam x e) = Closure x e env
evalLC env (LCApp e1 e2) = 
  case evalLC env e1 of
    Closure x e env' -> evalLC (extend env' x (evalLC env e2)) e
evalLC env (LCPair e1 e2) =
  let v1 = evalLC env e1 in
  let v2 = evalLC env e2 in
  Pair v1 v2
evalLC env (LCFst e) =
  case evalLC env e of
    Pair v1 v2 -> v1
evalLC env (LCSnd e) =
  case evalLC env e of
    Pair v1 v2 -> v2

Lexing and parsing

How do we take a string to an AST? The first step of any interpreter or compiler is parsing, where a sequence of characters is transformed into a syntax tree.

Parsing is typically broken into two steps: lexing and parsing proper.

A concrete syntax, as represented on disk or in memory as a string (i.e., a list of characters), is translated to a stream of tokens. This process is called lexing, because it breaks a string into lexical tokens.

For example, consider our original arithmetic expressions.

data ArithExp = 
    Num Int
  | Plus ArithExp ArithExp
  | Times ArithExp ArithExp
  | Neg ArithExp
  deriving (Eq,Show)

Consider the concrete string 2 + 10 * 3. The relevant tokens are 2, +, 10, *, and 3. Note that I left whitespace out: I expect 2 + 10 * 3 to behave the same as 2+10*3. Here’s a data definition for the relevant kind of tokens:

data Token =
    TNum Int
  | TPlus
  | TMinus
  | TTimes
  | TLParen
  | TRParen
  deriving (Show, Eq)

Concretely, the string "2 + 10 * 3" should produce the token list [TNum 2, TPlus, TNum 10, TTimes, TNum 3]. We looked at a lexer that does this translation. It recurs over its input, identifying which token is there. Note how it’s careful to dispose of whitespace first, then check for appropriate symbols, then convert to numbers, and finally to give up.

lexer :: String -> [Token]
lexer [] = []
lexer (w:s) | isSpace w = lexer (dropWhile isSpace s)
lexer ('+':s) = TPlus:lexer s
lexer ('-':s) = TMinus:lexer s
lexer ('*':s) = TTimes:lexer s
lexer ('(':s) = TLParen:lexer s
lexer (')':s) = TRParen:lexer s
lexer s | isDigit (head s) =
  let (n,s') = span isDigit s in
  TNum (read n :: Int):lexer s'
lexer (n:_) = error $ "Lexer error: unexpected character " ++ [n]