Lecture 10 — 2015-10-05

Parsing

This lecture is written in literate Haskell; you can download the raw source.

I opened class by mentioning the possibility of getting a scholarship to attend PLMW at POPL, a mentoring workshop at one of the top PL conferences. Email me if you’re interested.

I also mentioned my upcoming elective, 181N Software Foundations, where we’ll we using the Coq theorem prover to build verified software. It should be fun!

We spent the rest of class looking at lexers and parsers.

Lexing

We recapped the arithmetic lexer from last time.

import Data.Char

data Token =
    TNum Int
  | TPlus
  | TMinus
  | TTimes
  | TLParen
  | TRParen
  deriving (Show, Eq)

lexer :: String -> [Token]
lexer [] = []
lexer (w:s) | isSpace w = lexer (dropWhile isSpace s)
lexer ('+':s) = TPlus:lexer s
lexer ('-':s) = TMinus:lexer s
lexer ('*':s) = TTimes:lexer s
lexer ('(':s) = TLParen:lexer s
lexer (')':s) = TRParen:lexer s
lexer s | isDigit (head s) =
  let (n,s') = span isDigit s in
  TNum (read n :: Int):lexer s'
lexer (n:_) = error $ "Lexer error: unexpected character " ++ [n]

Parsing

We also covered parsing in general, doing an overview of the parser for arithmetic expressions.

data ArithExp = 
    Num Int
  | Plus ArithExp ArithExp
  | Times ArithExp ArithExp
  | Neg ArithExp
  deriving (Eq,Show)

parse :: String -> ArithExp
parse s = 
  case parseTerm $ lexer s of
    Right (e,[]) -> e
    Right (_,ts) -> error $ "Parse error: expected EOF, found: " ++ show ts
    Left e -> error $ "Parser error: " ++ e

I took particular care to note the loops in play: parseTerm calls parseTerm', which is mutually recursive with parseTerm''; there’s a similar setup with parseFactor, parseFactor', and parseFactor''.

parseTerm ts = 
  case parseFactor ts of
    Right (f,ts) -> parseTerm' f ts
    Left e -> Left e
      
parseTerm' lhs [] = Right (lhs, [])
parseTerm' lhs (TPlus:ts) = parseTerm'' (Plus lhs) ts
parseTerm' lhs (TMinus:ts) = parseTerm'' (Plus lhs . Neg) ts
parseTerm' lhs ts = parseFactor' lhs ts

parseTerm'' mk [] = Left $ "expected term after +/-"
parseTerm'' mk ts = 
  case parseFactor ts of
    Right (e,ts) -> parseTerm' (mk e) ts
    Left e -> Left e

parseFactor ts =
  case parseNum ts of
    Right (lhs,ts) -> parseFactor' lhs ts
    Left e -> Left e
    
parseFactor' lhs (TTimes:ts) = parseFactor'' (Times lhs) ts
parseFactor' lhs ts = Right (lhs, ts)

parseFactor'' mk [] = Left $ "expected term after *"
parseFactor'' mk ts = 
  case parseNum ts of
    Right (e,ts) -> parseFactor' (mk e) ts
    Left e -> Left e

parseNum (TNum n:ts) = Right (Num n, ts)
parseNum (TLParen:ts) =
  case parseTerm ts of
    Right (e,TRParen:ts') -> Right (e,ts')
    Right (_,ts) -> Left $ "expected right paren, found: " ++ show ts
    Left e -> Left e
parseNum ts = Left $ "expected numbers or parens, found: " ++ show ts

Using alex

We also went over generated lexers and parsers. Suppose the following is the contents of a file Lexer.x:

{
module Lexer where
}

%wrapper "basic"

$digit = 0-9               -- digits
$alpha = [a-zA-Z]          -- alphabetic characters

tokens :-

  $white+                         ;
  \(                              { const TLParen }
  \)                              { const TRParen }
  \\                              { const TBackslash }
  \.                              { const TDot }
  $alpha [$alpha $digit \_ \']*   { \s -> TId s }

{
-- Each action above has type :: String -> LCToken

-- The token type:
data LCToken =
    TBackslash 
  | TLParen
  | TRParen
  | TDot
  | TId String
  deriving (Eq,Show)
}

Running alex on a file like the above will generate a new file, Lexer.hs, which has a function alexScanTokens :: String -> [Token].

In terms of the general file structure:

  • Things between { and } are Haskell code.
  • The $digit = [0-9] sets up an alex variable $digit that holds a regular expression matching decimal digits.
  • The code following tokens :- sets up a mapping from regular expressions to tokens.
  • The LCToken datatype establishes the set of tokens.

Just for contrast, here’s a manually written lexer that parses the same tokens:

data LCToken =
    LCBackslash 
  | LCLParen
  | LCRParen
  | LCDot
  | LCId String
  deriving (Eq,Show)

lexer' :: String -> [LCToken]
lexer' [] = []
lexer' (w:s) | isSpace w = lexer' (dropWhile isSpace s)
lexer' ('\\':s) = LCBackslash:lexer' s
lexer' ('.':s) = LCDot:lexer' s
lexer' ('(':s) = LCLParen:lexer' s
lexer' (')':s) = LCRParen:lexer' s
lexer' s | isAlpha (head s) =
  let (id,s') = span isAlpha s in
  LCId id:lexer' s'
lexer' (n:_) = error $ "Lexer error: unexpected character " ++ [n]

Using happy

We saw two happy grammars: one for arithmetic and one for the lambda calculus. Suppose we have the following file, Parser.y.

{  
module Parser where

import Data.Char
}

%name parseTokens
%tokentype { Token }
%error { parseError }

%token
  num  { TNum $$ }
  "+"  { TPlus }
  "*"  { TTimes }
  "-"  { TMinus }  
  "("  { TLParen }
  ")"  { TRParen }

%%

term :
    term "+" factor { Plus $1 $3 }
  | term "-" factor { Plus $1 (Neg $3) }
  | factor          { $1 }
    
factor :
    factor "*" neg { Times $1 $3 }
  | neg            { $1 }

neg :
    "-" atom { Neg $2 }
  | atom     { $1 }

atom :
    num          { Num $1 }
  | "(" term ")" { $2 }
    
{
  
data ArithExp = 
    Num Int
  | Plus ArithExp ArithExp
  | Times ArithExp ArithExp
  | Neg ArithExp
  deriving (Eq,Show)

parse :: String -> ArithExp           
parse = parseTokens . lexer

parseError tokens = error $ "Parser error: " ++ show tokens

}

Running happy Parser.y will generate Parser.hs with the function parseTokens :: [Token] -> ArithExp. A few notes:

  • Again, those parts between { and } are Haskell syntax. The rest is happy syntax.
  • The part after %token sets up the terminals we’ll use in the grammar. The $$ variable identifies which part of a token is relevant for positional variables. These are patterns, where $$ identifies the part you’d like bound to the corresponding positional variable.
  • The part after %% defines the grammar’s nonterminals, of the form name : list of terminals and nonterminals { code for an AST } | ...
  • The code for generating ASTs in each production uses positional variables to refer to tokens (and their values, as established by $$).
  • The parserError function is necessary for happy to work.

The driver

Finally, I mentioned the Main.hs and Makefile I distributed with the code. These put let you test your code—use them!