The next step to writing a compiler is creating the parser. For the parser generator I’m using Happy which is the Haskell version of yacc.
To make following along with me easier I’ve uploaded the parser and other files for NewL to github.com at http://github.com/bjwbell/NewL-Compiler. The parser is in the parser directory. Please note that the parser contains no error checking. I’ll explain how to add error checking in a later blog post.
To play with the parser and test it go to http://github.com/bjwbell/NewL-Compiler/archives/master. It’ll ask you to download the source. Once you’ve downloaded the source and unzipped it, open a terminal and go to the parser directory.
To compile the parser you’ll need to have the Glasgow Haskell Compiler collection installed. You also need to have Happy and Alex installed. If you’re running Ubuntu all of them are available via the Ubuntu repositories. To compile the parser run
make
and to test the parser run
cat test1.newl | ./newl
It will output the parse tree representation of the test1.newl program.
Before going into how I wrote the parser I want to mention that to get the scanner and parser to play together you have to delete
main = do
s <- getContents
print (alexScanTokens s)
from scanner.x. Otherwise Alex would generate a standalone scanner.
For an explanation of how the parts to the parser fit together I strongly encourage you to read http://www.haskell.org/happy/doc/html/sec-using.html. It has a very well written explanation of writing a parser using Happy.
The first part of the parser is to declare the name and what modules we import.
{
module Main where
import Scanner
}
The important part is “import Scanner”. This imports the scanner that was declared in scanner.x. In particular notice that the name “Scanner” matches the module name declared in scanner.x. The second part is to declare the name of our parser, you can choose any name you like. We also declare the token type which must match the token type declared in scanner.x, finally we give the name of the error function.
%name newl
%tokentype { Token }
%error { parseError }
Now onto to the meat of the parser. We declare a list of tokens as I’ve done below.
%token
"class" { TClass }
"new" { TNew }
"String" { TString }
"static" { TStatic }
"void" { TVoid }
"main" { TMain }
"public" { TPublic }
"return" { TReturn }
"extends" { TExtend }
"int" { TInt }
"boolean" { TBool }
"if" { TIf }
"else" { TElse }
"true" { TTrue }
"false" { TFalse }
"this" { TThis }
"length" { TLength }
"while" { TWhile }
integer_literal { TIntLiteral $$ }
ident { TIdent $$ }
"{" { TLeftBrace }
"}" { TRightBrace }
"," { TComma }
"[" { TLeftBrack }
"]" { TRightBrack }
op { TOp $$}
comop { TComOp $$ }
"(" { TLeftParen }
")" { TRightParen }
";" { TSemiColon }
"." { TPeriod }
"!" { TNot }
"=" { TEquals }
"System.out.println" { TPrint }
%%
“The symbols on the left are the tokens as they will be referred to in the rest of the grammar, and to the right of each token enclosed in braces is a Haskell pattern that matches the token. The parser will expect to receive a stream of tokens, each of which will match one of the given patterns .” [1] (the definition of the Token datatype is given in scanner.x)
Now we declare the production rules to the grammar (to view the grammar in BNF notation see http://bloggingmath.wordpress.com/2010/01/20/writing-a-compiler-in-haskell-compiler-series-part-i/).
Program :
MainClass ClassDeclList { Program $1 $2 }
MainClass :
"class" ident "{" "public" "static" "void" "main" "(" "String" "[" "]" ident ")" "{" Statement "}" "}" { MClass $2 $12 $15 }
ClassDeclList :
ClassDecl { ClassDeclList $1 CEmpty }
| ClassDecl ClassDeclList { ClassDeclList $1 $2 }
| { CEmpty }
ClassDecl :
"class" ident "{" VarDeclList MethodDeclList "}" { ClassDecl $2 "void" $4 $5 }
| "class" ident "extends" ident "{" VarDeclList MethodDeclList "}" { ClassDecl $2 $4 $6 $7 }
MethodDeclList :
MethodDecl { MethodDeclList $1 MEmpty }
| MethodDecl MethodDeclList { MethodDeclList $1 $2 }
| { MEmpty }
MethodDecl :
"public" Type ident "(" FormalList ")" "{" VarDeclList StatementList "return" Exp ";" "}" { MethodDecl $2 $3 $5 $8 $9 $11 }
VarDeclList :
Type ident ";" { VarDeclList $1 $2 VEmpty }
| Type ident ";" VarDeclList { VarDeclList $1 $2 $4 }
| { VEmpty }
FormalList :
Type ident { FormalList $1 $2 FEmpty }
| Type ident FormalList { FormalList $1 $2 $3 }
Type :
"int" "[" "]" { TypeIntArray }
| "boolean" { TypeBoolean }
| "int" { TypeInt }
| ident { TypeIdent $1 }
Statement :
"{" StatementList "}" { SList $2 }
| "if" "(" Exp ")" Statement "else" Statement { SIfElse $3 $5 $7 }
| "while" "(" Exp ")" Statement { SWhile $3 $5 }
| "System.out.println" "(" Exp ")" ";" { SPrint $3 }
| ident "=" Exp ";" { SEqual $1 $3 }
| ident "[" Exp "]" "=" Exp ";" { SArrayEqual $1 $3 $6 }
StatementList :
Statement { StatementList Empty $1 }
| StatementList Statement { StatementList $1 $2 }
Exp :
Exp op Exp { ExpOp $1 $2 $3}
| Exp comop Exp { ExpComOp $1 $2 $3}
| Exp "[" Exp "]" { ExpArray $1 $3}
| Exp "." "length" { ExpLength $1}
| Exp "." ident "(" ExpList ")" { ExpFCall $1 $3 $5}
| integer_literal { ExpInt $1}
| "true" { ExpBool True}
| "false" { ExpBool False}
| ident { ExpIdent $1}
| "this" { ExpThis }
| "new" "int" "[" Exp "]" { ExpNewInt $4 }
| "new" ident "(" ")" { ExpNewIdent $2}
| "!" Exp { ExpNot $2}
| "(" Exp ")" { ExpExp $2}
ExpList :
Exp { ExpListExp $1 }
| Exp ExpRest { ExpList $1 $2 }
| { ExpListEmpty }
ExpRest :
"," Exp { ExpRest $2 }
“Each production consists of a non-terminal symbol on the left, followed by a colon, followed by one or more expansions on the right, separated by |. Each expansion has some Haskell code associated with it, enclosed in braces as usual.” [1]
“The way to think about a parser is with each symbol having a `value’: we defined the values of the tokens above, and the grammar defines the values of non-terminal symbols in terms of sequences of other symbols (either tokens or non-terminals).” [1]
With production rules declared we now declare the data types for them. Which I’ve listed below.
{
parseError :: [Token] -> a
parseError _ = error "Parse error"
data Program
= Program MainClass ClassDeclList
deriving (Show, Eq)
data MainClass
= MClass String String Statement
deriving (Show, Eq)
data ClassDeclList
= ClassDeclList ClassDecl ClassDeclList
| CEmpty
deriving (Show, Eq)
data ClassDecl = ClassDecl Ident Ident VarDeclList MethodDeclList
deriving (Show, Eq)
data MethodDeclList
= MethodDeclList MethodDecl MethodDeclList
| MEmpty
deriving (Show, Eq)
data MethodDecl
= MethodDecl Type Ident FormalList VarDeclList StatementList Exp
deriving (Show, Eq)
data VarDeclList =
VarDeclList Type Ident VarDeclList
| VEmpty
deriving (Show, Eq)
data FormalList =
FormalList Type Ident FormalList
| FEmpty
deriving (Show, Eq)
data Type =
TypeIntArray
| TypeBoolean
| TypeInt
| TypeIdent Ident
deriving (Show, Eq)
data Statement
= Statement String
| SList StatementList
| SIfElse Exp Statement Statement
| SWhile Exp Statement
| SPrint Exp
| SEqual Ident Exp
| SArrayEqual Ident Exp Exp
| StatementError
deriving (Show, Eq)
data StatementList
= StatementList StatementList Statement
| Empty
deriving (Show, Eq)
data Exp
= Exp String
| ExpOp Exp Char Exp
| ExpComOp Exp Char Exp
| ExpArray Exp Exp -- "Exp [ Exp ]"
| ExpFCall Exp Ident ExpList -- Exp . Ident ( ExpList )
| ExpInt Int
| ExpNewInt Exp
| ExpBool Bool -- True or False
| ExpIdent Ident
| ExpNewIdent Ident -- new Ident ()
| ExpExp Exp -- Exp ( Exp )
| ExpThis
| ExpNot Exp
| ExpLength Exp
| ExpError
deriving (Show, Eq)
data Op
= And
| LessThan
| Plus
| Minus
| Times
deriving (Show, Eq)
type Ident = String
type Integer_Literal = Int
data ExpList
= ExpList Exp ExpRest
| ExpListEmpty
| ExpListExp Exp
deriving (Show, Eq)
data ExpRest
= ExpRest Exp
deriving (Show, Eq)
main = do
inStr <- getContents
let parseTree = newl (alexScanTokens inStr)
putStrLn ("parseTree:" ++ show(parseTree))
print "done"
}
I’ve also defined the “parseError” function and the main function.
The main function uses “getContents” to retrieve the input from stdin and then creates the parse tree using “newl (alexScanTokens inStr)”. Note the “newl” name is the same one we declared using “%name newl” at the beginning of the program.
With the above code we now have a working parser for the NewL language. At this point if there’s an error parsing it barfs out “parse error” and doesn’t tell us anything about the error. Also no type or semantic analysis is done. I’ll expand on both of these points in later posts.
References
[1] Chapter 2. Using Happy, http://www.haskell.org/happy/doc/html/sec-using.html