# Homework 4

## Parsing

This homework is written in literate Haskell; you can download the raw source to fill in yourself. You’re welcome to submit literate Haskell yourself, or to start fresh in a new file, literate or not.

Please submit homeworks via the new submission page.

``````module Hw04 where

import Control.Applicative
import Data.Char

import qualified Data.Map as Map
import Data.Map (Map)``````

In this homework, you’ll write some pretty-printers and parsers for the While language of HW03.

``````type VarName = String

data AExp =
Var VarName
| Num Int
| Plus AExp AExp
| Times AExp AExp
| Neg AExp
| Div AExp AExp
deriving Eq

data BExp =
Bool Bool
| Equal AExp AExp
| Lt AExp AExp
| Not BExp
| Or BExp BExp
| And BExp BExp
deriving (Show, Eq)

data Stmt =
Skip
| Assign VarName AExp
| Seq Stmt Stmt
| If BExp Stmt Stmt
| While BExp Stmt
deriving (Show, Eq)``````

### Problem 1: Pretty printing with `Show`

Write a `Show` instance for `AExp` that prints as few parentheses as possible. That is, `show (Plus (Var "x") (Plus (Num 6) (Num 36)))` should return `"x + 6 + 36"`, not `"x + (6 + 36)"`. Similarly, `show (Plus (Num 5) (Times (Num 6) (Num 36)))` should return `5 + 6 * 36`, because multiplication has a higher precedence than addition. But `show (Times (Num 5) (Plus (Num 6) (Num 36)))` should return `"5 * (6 + 36)"`. Finally, you should encode syntactic sugar in the pretty printing: if you encounter a node like `Plus (Num 5) (Neg (Var x))`, please render it as `"5 - x"`. Take care that subtraction and division are both left associative.

The trick is to have the pretty printer follow the structure of the parser. There are two approaches: either take an extra integer parameter and use it to determine what “level” you’re at, or actually write several functions (`showTerm`, `showFactor`, etc.).

``````instance Show AExp where
show = undefined``````

### Problem 2: Parsing expressions by hand

Traditionally, parsing is broken into two phases: lexing, which converts a string to a list of tokesn; and parsing, which converts a list of tokens to an abstract syntax tree (AST). In this problem, you’ll extend a lexer and parser written for arithmetic expressions to one that lexes and parses a version of the While language above.

To start with, we’ll define some tokens, the core “meaningful” bits of our language. Consider the concrete string `2 + 10 * 3`. The relevant tokens are `2`, `+`, `10`, `*`, and `3`. Note that I left whitespace out: I expect `2 + 10 * 3` to behave the same as `2+10*3`—most languages ignore whitespace (but just wait until Problem 4!). Here’s a data definition for the relevant kind of tokens:

``````data Token =
TNum Int
| TId String
| TPlus
| TMinus
| TTimes
| TDiv
| TLParen
| TRParen``````
``  deriving (Show, Eq)``
``````lexer :: String -> [Token]
lexer [] = []
lexer (w:s) | isSpace w = lexer (dropWhile isSpace s)
lexer ('+':s) = TPlus:lexer s
lexer ('-':s) = TMinus:lexer s
lexer ('*':s) = TTimes:lexer s
lexer ('/':s) = TDiv:lexer s
lexer ('(':s) = TLParen:lexer s
lexer (')':s) = TRParen:lexer s``````
``````lexer s | isAlpha (head s) =
let (id,s') = span isAlphaNum s in
TId id:lexer s'
lexer s | isDigit (head s) =
let (n,s') = span isDigit s in
TNum (read n :: Int):lexer s'
lexer (n:_) = error \$ "Lexer error: unexpected character " ++ [n]``````

Test this lexer out, and convince yourself it correctly translates a string into its constituent tokens. Concretely, the string `"2 + 10 * 3"` should produce the token list `[TNum 2, TPlus, TNum 10, TTimes, TNum 3]`. The lexer recurs over its input, identifying which token is at the front each time. Note how it’s careful to dispose of whitespace first, then check for appropriate symbols, then convert to numbers, and finally to give up.

Parsers operate on streams of tokens, trying to produce an AST; each parser function also returns the remaining stream of tokens. Take particular care to note the loops in play: `parseATerm`: calls `parseATerm'`, which is mutually recursive with `parseATerm''`; there’s a similar setup with `parseAFactor`, `parseAFactor'`, and `parseAFactor''`.

``````parseATerm :: [Token] -> Either String (AExp,[Token])
parseATerm ts =
case parseAFactor ts of
Right (f,ts) -> parseATerm' f ts
Left e -> Left e

parseATerm' :: AExp -> [Token] -> Either String (AExp,[Token])
parseATerm' lhs [] = Right (lhs, [])
parseATerm' lhs (TPlus:ts) = parseATerm'' (Plus lhs) ts
parseATerm' lhs (TMinus:ts) = parseATerm'' (Plus lhs . Neg) ts
parseATerm' lhs ts = Right (lhs,ts)``````

Notice how we just encoded subtraction above! It’s just like what we did in HW02 with the `translate` function, but we’ve built it into our parser. Adding niceties like this is called syntactic sugar.

``````parseATerm'' :: (AExp -> AExp) -> [Token] -> Either String (AExp,[Token])
parseATerm'' mk [] = Left \$ "expected term after +/-"
parseATerm'' mk ts =
case parseAFactor ts of
Right (e,ts) -> parseATerm' (mk e) ts
Left e -> Left e

parseAFactor :: [Token] -> Either String (AExp,[Token])
parseAFactor ts =
case parseANeg ts of
Right (lhs,ts) -> parseAFactor' lhs ts
Left e -> Left e

parseAFactor' :: AExp -> [Token] -> Either String (AExp,[Token])
parseAFactor' lhs (TTimes:ts) = parseAFactor'' (Times lhs) ts
parseAFactor' lhs (TDiv:ts) = parseAFactor'' (Div lhs) ts
parseAFactor' lhs ts = Right (lhs, ts)

parseAFactor'' :: (AExp -> AExp) -> [Token] -> Either String (AExp,[Token])
parseAFactor'' mk [] = Left \$ "expected term after *"
parseAFactor'' mk ts =
case parseANeg ts of
Right (e,ts) -> parseAFactor' (mk e) ts
Left e -> Left e

parseANeg :: [Token] -> Either String (AExp, [Token])
parseANeg (TMinus:ts) =
case parseAAtom ts of
Right (e,ts') -> Right (Neg e, ts')
Left e -> Left e
parseANeg ts = parseAAtom ts

parseAAtom :: [Token] -> Either String (AExp, [Token])
parseAAtom (TNum n:ts) = Right (Num n, ts)
parseAAtom (TId id:ts) = Right (Var id, ts)
parseAAtom (TLParen:ts) =
case parseATerm ts of
Right (e,TRParen:ts') -> Right (e,ts')
Right (_,ts) -> Left \$ "expected right paren, found: " ++ show ts
Left e -> Left e
parseAAtom ts = Left \$ "expected number, identifier, or parens, found: " ++ show ts``````

To test this parser, play with the following “composed” function that combines the lexer and parser into one:

``````tryParse :: ([Token] -> Either String (a,[Token])) -> String -> a
tryParse parser s =
case parser \$ lexer s of
Right (e,[]) -> e
Right (_,ts) -> error \$ "Parse error: expected EOF, found: " ++ show ts
Left e -> error \$ "Parser error: " ++ e

tryParseATerm :: String -> AExp
tryParseATerm = tryParse parseATerm``````

Your job is to finish writing a parser for the While language in an old school Pascal-style syntax. Here are some example programs:

``````pascalProg1 = "x := 5; y := 6"
pascalProg2 = "IF x = 0 THEN y := 10 ELSE y := x END"
pascalProg3 = "IF NOT (x = 0) THEN y := 10 ELSE y := x END"
pascalProg4 = "IF x != 0 THEN y := 10; SKIP ELSE y := x END"
pascalProg5 = " \t\nWHILE IFFY > 10 AND (true OR false) DO IFFY := IFFY - 1 END;\n\nderp := 7"
pascalProg6 = "WHILE true DO SKIP END"
pascalProg7 = "IF x>=0 THEN y := 10 ELSE SKIP END;\ny := x + y;\nx := 5"
pascalProg8 = "IF x=46 THEN x := x + 1 END; oneArm := x * 2"``````

Note that some of these syntactic forms aren’t directly supported in the AST (e.g., there’s no way to directly say `>=` in `BExp`); you’ll have to encode these ideas using syntactic sugar. Also, be careful: `IF` is a keyword, but `IFFY` is a variable name.

These syntactic forms should cover all of the corner cases, but I might have missed one. On Monday, February 27th, please bring a couple of example programs to class so we can discuss them.

You’ll need to:

1. Extend the `Token` type to account for other tokens (e.g., `WHILE`).
2. Change the lexer to find these new tokens. (Watch out for keyword/variable collisions!)
3. Write a parser for boolean expressions. If you’re wondering about precedence, think of `OR` like `+` and `AND` like `*`. For relations like `=` and `!=` and `>=`, they should be parsed under OR, e.g., `"x = y OR x < y AND y = z"` should parse as `Or (Equal x y) (And (Lt x y) (Equal y z))`
4. Write a parser for commands/statements.

You should largely be able to follow the pattern for arithmetic expressions established already, but you’ll have to be creative in a few spots.

Parsers are hard to write by hand. The surest way to avoid bugs is to write clean code.

I will only test your last step, the `pascalSyntax` function. I’ve given you a bit of a start here: I suggest you split statement parsing in two parts: one part for parsing a sequence of statements and one part for parsing an individual stement.

``````parseStmts :: [Token] -> Either String (Stmt,[Token])
parseStmts ts = undefined

pascalSyntax :: String -> Stmt
pascalSyntax = tryParse parseStmts``````

Do not change `pascalSyntax`’s type or general behavior—whatever you do, your parser should follow the general scheme of the `tryParse` functions we’ve seen.

### Problem 3: Parsing expressions using Applicative parser combinators

Problem 3 is long enough that I’ve marked the things you should do with HWHWHW. I haven’t done so for the other problems, which have fewer parts.

A `Parser a` is a function that takes a `String` and may return an `a` and a `String`. Intuitively, a Parser either (1) fails to parse, or (2) parses part of the input (at type `a`) and has some remaining, unparsed input.

``newtype Parser a = Parser { parse :: String -> Maybe (a,String) }``

We’ll start out by defining a few necessary instances, but our goal isn’t to write `Parser (\s -> ...)` by hand: we’ll construct a library of combinators where we can build up parsers nicely.

But first, `Parser` is a functor: `fmap` applies to the parsed value.

``````instance Functor Parser where
fmap f p = Parser \$ \s -> (\(a,c) -> (f a, c)) <\$> parse p s``````

`Parser` is also applicative: a pure `Parser` always succeeds, just returning its input. The `(<*>)` (read: “ap”) function takes an `f :: Parser (a -> b)` and a `Parser a` and combines them: it first parses `f` to get a function `g :: a -> b` and some remaining bit of string to be parsed, `s'`. It then parses `s'` with `a :: Parser a`, making sure to apply `g` to the result. (Note how we use `fmap` to concisely say “apply `g` to the result of parsing with `a`.)

``````instance Applicative Parser where
pure a = Parser \$ \s -> Just (a,s)
f <*> a = Parser \$ \s ->
case parse f s of
Just (g,s') -> parse (fmap g a) s'
Nothing -> Nothing``````

There’s one last type class to define: `Alternative`, which is a left-biased choice. In the `empty` case, we just return `Nothing`. When considering two alternatives, we can take advantage of the existing `Alternative` class for `Maybe`: try to parse the left-hand side, and if that fails, try the right.

``````instance Alternative Parser where
empty = Parser \$ \s -> Nothing
l <|> r = Parser \$ \s -> parse l s <|> parse r s``````

Below we’ll define a few “primitive” parsers, which explicitly write parsing functions. Don’t define any extra ones! All of your solutions should use (a) these primitive parsers, (b) Functor/Applicative/Alternative, and (c) other helpers/recursive functions you define.

``````ensure :: (a -> Bool) -> Parser a -> Parser a
ensure p parser = Parser \$ \s ->
case parse parser s of
Nothing -> Nothing
Just (a,s') -> if p a then Just (a,s') else Nothing

where f [] = Just (Nothing,[])
f (c:s) = Just (Just c,c:s)

satisfy :: (Char -> Bool) -> Parser Char
satisfy p = Parser f
where f [] = Nothing
f (x:xs) = if p x then Just (x,xs) else Nothing``````

We can detect the end of the file manually…

``````eof :: Parser ()
eof = Parser \$ \s -> if null s then Just ((),[]) else Nothing``````

…but the other three primitive parsers are enough on their own.

HWHWHW Write a function `eof'` which behaves like `eof` but is built up from the other primitive parsers:

``````eof' :: Parser ()
eof' = undefined``````

In general, we’ll be careful to allow arbitrary whitespace at the beginnings of what we parse.

``````ws :: Parser ()
ws = pure () <* many (satisfy isSpace)

char :: Char -> Parser Char
char c = ws *> satisfy (==c)

str :: String -> Parser String
str s = ws *> loop s
where loop [] = pure []
loop (c:cs) = (:) <\$> satisfy (==c) <*> loop cs``````

We can define what it means to parse something in parentheses. Note how we’re careful to specify order of operations between `(*>)` and `(<*)`—forgetting to do this can lead to some tricky bugs!

``````parens :: Parser a -> Parser a
parens p = (char '(' *> p) <* char ')'``````

Let’s start our parser from the bottom up. We’ll begin with a notion of keyword a/k/a a reserved word. These are English words that can’t be used as variables because they’ll be part of the concrete syntax of our language.

``````keywords :: [String]
keywords = ["SKIP","IF","THEN","ELSE","END","WHILE","DO","true","false"]

isKeyword = (`elem` keywords)``````

To parse a keyword, we’ll parse it as a string (using `str`) but make sure that it’s the whole identifier. For example, `parse (kw "repeat") "repeat "` should return `Just ("repeat"," ")`, but `parse (kw "repeat") "repeatable"` should return `Nothing`.

HWHWHW Write the that parses a keyword (you don’t need to check that the given string is in the list `keywords`).

``````kw :: String -> Parser String
kw s = undefined``````

HWHWHW Now let’s parse variables. A variable is (1) a string starting with an alphabetical character followed by zero or more alphanumeric characters that (2) isn’t a keyword.

``````var :: Parser String
var = undefined``````

A number is one or more digits. We can use `read :: Read a => String -> a` to read out an Int, since a `Read Int` instance is defined in the Prelude. Note that the type annotation here is critical, or Haskell will complain—it doesn’t know what you want to read!

``````num :: Parser Int
num = ws *> (read <\$> some (satisfy isDigit))``````

HWHWHW Now define an `AExp` parser; call the top-level “terms” `aexp`, but you can call the rest whatever you want. Feel free to copy/paste most of it from lecture notes, but don’t forget to add division! Please also encode subtraction using addition and negation.

``````aexp :: Parser AExp
aexp = undefined``````

HWHWHW Now let’s define a `BExp` parser; we’ll have to use the `aexp` parser. If you’re unsure of which precedences to use here, recall that in Boolean algebra, conjunction is multiplication and disjunction is addition.

Your `bexp` parser should use the existing comparisons on `AExp`s to encode all of the relevant comparisons. Here are some examples and what they parse to; you don’t have to parse exactly to what I do, but you should parse to something equivalent. Use C-style syntax: and is `&&`, or is `||`, and not is `!`. The constants can just use lower-case names, like `true` and `false`.

``````bexp_encoding_test1 = "x == y" -- Equal (Var x) (Var y)
bexp_encoding_test2 = "x <= y" -- Or (Equal (Var x) (Var y)) (Lt (Var x) (Var y))
bexp_encoding_test3 = "x != y" -- Not (Equal (Var x)) (Var y)

bexp :: Parser BExp
bexp = undefined``````

Please use `aexp` and `bexp` to parse expressions in the parsers you write below.

### Problem 4: Parsing a la Pascal

Now we can write a parser for statements. Let’s use the same Pascal-like syntax we used above.

Your top-level parser should be called `pascalSyntax'`. It’s helpful to think of a top-level program as being a sequence of one more statements (in the concrete syntax, separated by `;`; in the abstract syntax tree, joined by `Seq`). Don’t forget to make sure that your parser consumes all of its input when possible. (Hint: think about `eof`.)

Just as above, you’ll definitely want to write more test programs. If you come up with interesting corner cases, post them to Piazza! I want to explicitly encourage collaboration on understanding the syntax (though please write your own parsers within your pair).

``````pascalSyntax' :: Parser Stmt
pascalSyntax' = undefined``````

Note: you will not be penalized for extra `Skip`s in your parsed output. If inserting a `Skip` here or there makes your life easier, go for it.

### Optional Challenge Problem 5: Parsing a la Python

YOU DON’T NEED TO DO THIS PROBLEM AND I WON’T GRADE IT, but it’s excellent practice. If you can do it, you have parsers mastered.

Python’s syntax is pleasantly spare, a nice example of whitespace-sensitive syntax. (You may notice whitespace is meaningful in Haskell’s syntax, as well!) The idea is simple: developers already use indentation to indicate, e.g., how nested you are in loops. Why not make such indentation a requirement and reduce the clutter from curly brackets?

Robust languages can infer the indentation level used on a per-program or per-block basis, but let’s fix the indentation level at two: every time we open an if or while block, we’ll indent the blocks by two more spaces than we were.

Here’s a sample program:

``pyProg1 = "x = 5\ny = 6"``

Oy—those newlines are going to get tiresome. There’s a handy function, `unlines`, which will let us write this a clearer way:

``````pyProg1' = unlines ["x = 5",
"y = 6"]

pyProg2 = unlines ["if x == 0:",
"  y = 10",
"else:",
"  y = x"]
pyProg3 = unlines ["while iffy > 10: ",
"  iffy = iffy - 1"]
pyProg4 = unlines ["while false:",
"  ",
"x = 0"]
pyProg5 = unlines ["while x > 0:",
"  while y < x:",
"    x = x - 1",
"    y = y + 1",
"  x = x - 1",
""]``````

I’ve tried to cover the corner cases with these examples, but please do discuss them. I’ll try to settle any ambiguities that may arise.

Please write another parser for our language, but in the style of Python. Your top-level parser of statements should be named `pythonSyntax`:

``````pythonSyntax :: Parser Stmt
pythonSyntax = undefined``````

You can largely follow the same idea as your Pascal-style parser, but your parsers for individual statements will need to keep track of the current indentation level, e.g., by taking an integer parameter. You’ll need to be much more careful about whitespace in general and newlines in particular. Writing a whitespace-sensitive parser is hard. Let “testing” be your watchword!