Haskell & LLVM Talk

I did a talk at the Linux Users’ Group of Davis yesterday evening that went very well. The QA and discussion afterwards was much better than previous talks I’ve given to students, the attendees from the Bay Area Haskell Users’s group were very sharp and had good comments.

Photos and slides are posted here and the slides are also viewable below.

Code Generation With LLVM (Compiler Series Part IX)

With type checking done, it’s time to generate code from the parse tree. Code generation typically consists of the following steps:
1. Generate intermediate code.
2. Generate the machine or byte code from the intermediate code.

The following are options for doing the above two tasks:

  • Generate intermediate and machine code for x86 or RISC using our own methods (tedious and error prone).
  • Generate the intermediate code and then byte code for either the JVM or CLI.
  • Outsource the problem to LLVM.

I’m a lazy fellow so I pick choice #3 of outsourcing the code generation to LLVM. This is a particularly apt choice given the exciting intersection of Haskell with LLVM recently via David Terei’s honors thesis on using LLVM as the backend for the GHC compiler. This work is ongoing at
http://hackage.haskell.org/trac/ghc/wiki/Commentary/Compiler/Backends/LLVM
and for an explanation of why LLVM is used as the backend see
http://blog.llvm.org/2010/05/glasgow-haskell-compiler-and-llvm.html
.

Bryan O’Sullivan put together bindings to LLVM at
http://code.haskell.org/llvm/
. Bryan and the other author, Lennart Augustsson, also posted examples of using the bindings at
http://www.serpentine.com/blog/2008/01/03/llvm-bindings-for-haskell/
,
http://augustss.blogspot.com/2009/01/llvm-llvm-low-level-virtual-machine-is.html
, and
http://augustss.blogspot.com/2009/01/llvm-arithmetic-so-we-want-to-compute-x.html
.

I’m reading up on LLVM and will post more as I learn more about LLVM in general and the Haskell LLVM bindings in particular.

Cleaning up NewL and Adding a Symbol Table Part II (Compiler Series Part VII)

Type checking has several stages, I’m breaking it up into a few posts to cover it more fully. To type check a program we need to:

  • 1. Create the symbol table for the classes. This symbol table is an association list where the key is the class name and the value is a “ClassSymbol”.
  • 2. For each class create the symbol table for the class methods which is again an association list where the key is the method name and the value is a “MethodSymbol”
  • 3. For each class create the symbol table of the class instance variables, this symbol table is an association list where the key is the variable name and the value is the variable type name. For example a class with one instance variable named “count” of type “int” has [("count", "int")] as the symbol table.
  • 4. Type check the statements and expressions within the methods. (future post)

In looking over our language NewL I realized that arrays and strings are not handled consistently. In particular a string was declared with a capital “S” in the “String” versus using all lowercase for “int” and “boolean”. I didn’t like this since it make the declaration syntax inconsistent for inbuilt types. The second issue is that it is only possible to declare arrays of integers. I want to make arrays an integral feature of the language. Arrays should be of any type. With those two changes in mind I’ve updated the grammar of NewL. The updated grammar is shown below. I’ve modified the lexer and parser where needed to reflect the update. You can get both here.

Terminal symbols are bolded and single character ones are in double quotes.

NewL Grammar

1. Program ::= MainClass { ClassDecl }
2. MainClass :== class Ident “{public static void main “(” string[] Ident “)” “{” Statement “}” “}
3. ClassDecl ::= class Ident “{” {VarDecl} {MethodDecl} “}
4. VarDecl ::= Type Ident “;
5. MethodDecl ::= public Type Ident “(” FormalList “)” “{” {VarDecl} {Statement} return Exp “;” “}
6. FormalList :== Type Ident { FormalRest } | ε
7. FormalRest :== “,” Type Ident
8. Type :== int | bool | int | string | Ident | IdentArray
9. Statement :== “{” { Statement } “}” | if “(“ Exp “)” Statement else Statement | while(” Exp “)” Statement | System.out.println(” Exp “)” “;” | Ident = Exp “;” | Ident “[" Exp "]” “=” Exp “;
10. Exp ::= Exp Op Exp | Exp “[" Exp "]” | Exp “.length | Exp “.” Ident “(” ExpList “)” |
Integer_Literal | String_Literal | true | false | Ident | this | new Ident “[" Exp "]” | new Ident “(” “)” | “!” Exp | “(” Exp “)
11. ExpList :== Exp { ExpRest } | ε
12. ExpRest ::= “,” Exp
13. Ident ::= Letter { Letter | Digit | “_” }
13. IdentArray ::= Letter { Letter | Digit | “_” }”[""]
14. Integer_Literal ::= Digit { Digit }
15. String_Literal ::= “” { Character } “
16. Op ::= && | < | +| - | *

Revised Hello World for NewL.

class HelloWorld {
    public static void main(string[] a) {
        System.out.println(new Hello().World());
    }
}
class Hello {
    public String World() {
        return "Hello World";
    }
}

With the revised NewL language. We can create the symbol table for the classes. Please note that each ClassSymbol structure itself contains a symbol table for its instance variables and its methods. The below code contains the functions for creating the symbol table of classes. There are several helper functions for creating the symbol table for the methods of each class and the symbol table of the instance variables of each class. Please refer back to the post on “Adding a Symbol Table” for the definition of ClassSymbol and MethodSymbol. With the symbol tables created we will be able to start the type checking.

classSymbols (Program mainClass classDeclList) = classSymbolMainClass mainClass : classSymbolsClassDeclList classDeclList

classSymbolMainClass (MClass className paramName statement) =
                      (className, (ClassSymbol className [] [
                                                            ("main", 
                                                                    (MethodSymbol {returnType = "void", name = "main", args = [("string[]", paramName)]})
                                                            )]
                                  )
                      )
classSymbolClassDecl (ClassDecl className parentClassName varDecls methodDecls) = (className, (ClassSymbol className (varSymbols varDecls) (methodSymbols methodDecls)))
classSymbolsClassDeclList (ClassDeclList classDecl classDeclList) = classSymbolClassDecl classDecl : classSymbolsClassDeclList classDeclList
classSymbolsClassDeclList (CEmpty) = []

varSymbols VEmpty = []
varSymbols (VarDeclList theType ident varDeclList) = varSymbol theType ident : varSymbols varDeclList

varSymbol (TypeString) ident = (ident, "string")
varSymbol (TypeBool) ident = (ident, "bool")
varSymbol (TypeInt) ident = (ident, "int")
varSymbol (TypeIdent identType) ident = (ident, identType)
varSymbol (TypeIdentArray identType) ident = (ident, identType)

methodSymbols MEmpty = []
methodSymbols (MethodDeclList methodDecl methodDeclList) = methodSymbol methodDecl : methodSymbols methodDeclList

methodSymbol (MethodDecl theType ident formalList varDeclList statementList exp)
             = case theType of
                    TypeInt -> (ident, MethodSymbol {returnType = "int", name = ident, args = (argSymbols formalList)})
                    TypeBool -> (ident, MethodSymbol {returnType = "bool", name = ident, args = (argSymbols formalList)})
                    TypeString -> (ident, MethodSymbol {returnType = "string", name = ident, args = (argSymbols formalList)})
                    TypeIdent classType -> (ident, MethodSymbol {returnType = classType, name = ident, args = (argSymbols formalList)})
                    TypeIdentArray classType -> (ident, MethodSymbol {returnType = classType, name = ident, args = (argSymbols formalList)})

argSymbols FEmpty = []
argSymbols (FormalList theType ident formalList) =
           case theType of
                TypeInt -> (ident, "int") : argSymbols formalList
                TypeBool -> (ident, "bool") : argSymbols formalList
                TypeString -> (ident, "string") : argSymbols formalList
                TypeIdent classType -> (ident, classType) : argSymbols formalList
                TypeIdentArray classType -> (ident, classType) : argSymbols formalList

You can download the source code here. Look in the Newl.y file for the parser with the symbol table added.

Adding a Symbol Table (Compiler Series Part VI)

Now that we have a lexer and parser it’s time to add a symbol table for doing type checking. Type checking/Semantic analysis is a huge subject so I’ll be splitting it into several posts.

Our compiler so far consists of the lexer and parser. For a complete compiler we need the following stages (the diagram is taken from figure 1.6 of the second edition of the Dragon Book [1]).

Compiler Stages

We need the beginnings of a symbol table to start the semantic analysis. The symbol table is what it says it is, a table of symbols. For example the following fragment of pseudo code would have the associated symbol tables

                               symbol_table1 = {}
int x;                         symbol_table1 = {"x" : int_type }
int y;                         symbol_table1 = {"x" : int_type, "y" : int_type}
{
    int z                      symbol_table2 = {"z": int_type}, symbol_table1 = {"x" : int_type, "y" : int_type
}

The above symbol tables only track the type of the variables. The code reveals two things, (1) it is necessary to have a stack of symbol tables to track the scope of a variable’s declaration and (2) to get the type of a variable a search of the symbol table stack must be done with the type determined by the first symbol table in the stack to contain a declaration of that variable (this is so that if a variable is shadowed the correct type is returned). Wikipedia has a good article on symbol tables here [2].

For the NewL compiler the construction of the symbol table and the type checking are very entwined. I determined what to put in the symbol table via what was required for type checking. I ended up with the following definition for two symbol types, one for methods and one for classes.

data ClassSymbol 
    = ClassSymbol 
      {
          className       :: String -- class name
        , publicVariables :: [(String, String)] -- (variable name, variable type name) list of variable
        , methods         :: [(String, MethodSymbol)] -- (method name, method symbol) list of methods
      }
      deriving (Eq)

instance Show ClassSymbol where 
    show (ClassSymbol cName vars methods) = show cName
    
data MethodSymbol = MethodSymbol {
      returnType :: String
    , name       :: String
    , args       :: [(String, String)] -- list of arguments
    } 
    deriving (Show, Eq)

The symbol table itself is an association list of variable names and their type names, for instance the following would be the symbol table upon entering the type checking for a class.

[("this", className)]

In my next post I’ll go into how to use the symbol table for some basic type checking.

References
[1] Compilers : Principles, Techniques, and Tools
http://dragonbook.stanford.edu/

[2] Wikipedia : Symbol Table

Not Happy, Happy Documentation (Compiler Series Part V)

After reading about monads in chapter 14 of Real World Haskell I thought I understood them to some extant. Then I read the section of the Happy user guide on threaded lexers and I was still confused on how to create a threaded lexer.

After looking at the threaded lexers section and not understanding it, I looked at the Haskell code that Happy generates when it runs. It turns out that the regular parser generator without a threaded lexer passes the token that caused the parse error as the first token in the token list to the “parseError” function. They DO NOT mention this in the documentation.

To grab the token, line and, column# that caused the parse error use the following function

parseError :: [Token] -> a
parseError tokenList = let pos = tokenPosn(head(tokenList)) 
  in 
  error ("parse error at " ++ show (head(tokenList))++ show(getLineNum(pos)) ++ ":" ++ show(getColumnNum(pos)))

The key point is the call to “head(tokenList)” this grabs the first token in the tokenList which is the token that caused the parse error.

You can download the source code and check it out here. To test the error reporting run “make” and then

bbell@bbell-desktop:~/NewL-Compiler/parser_with_error_reporting$ cat ../test_files/test1.txt  | ./newl 
newl: parse error at line 3 and column 9

Playing With Alex Again (Compiler Series Part IV)

Now that we have a scanner and parser I want to go back and add error reporting to both. For this post I’m only covering error reporting for the scanner. Our goal is simple, we want to abort on the first error the scanner encounters and print out the line and column number of the error.

In my previous post on the scanner I mentioned that we used the “basic” wrapper for Alex which provided a simple interface to the scanner that took a String and outputted a list of tokens.

For error reporting we need something more sophisticated. In particular the “posn” wrapper is almost perfect for our purposes. To use the “posn” wrapper we need to modify our Token type by annotating it with one property of type AlexPosn. The “posn” wrapper defines AlexPosn to contain the following three numbers, the absolute offset of the token, the line number of the token, and the column number of the token. Specifically

data AlexPosn = AlexPn !Int  -- absolute character offset
                       !Int  -- line number
                       !Int  -- column number

For now you can ignore the “!” in the “!Int”, for explanation of what it means go here.

This is exactly what we need for error reporting. To use the “posn” wrapper we have to modify our Token type to include an AlexPosn for representing the token position. For example

data Token =
     	TLeftBrace       |

becomes

data Token =
     	TLeftBrace AlexPosn     |

We also have to modify the scanner actions to include the passed AlexPosn value on each action. I’ve included the modified actions below.

tokens :-

  $white+				;
  "class"				{ \p s -> TClass p }
  "new"					{ \p s -> TNew p }
  "String"				{ \p s -> TString p }
  "static"				{ \p s -> TStatic p }
  "void"				{ \p s -> TVoid p }
  "main"				{ \p s -> TMain p }
  "return"              { \p s -> TReturn p }
  "public"				{ \p s -> TPublic p }
  "extends"				{ \p s -> TExtend p }
  "int"					{ \p s -> TInt p }
  "boolean"				{ \p s -> TBool p }
  "if"					{ \p s -> TIf p }
  "else"				{ \p s -> TElse p }
  "true"				{ \p s -> TTrue p }
  "false"				{ \p s -> TFalse p }
  "this"				{ \p s -> TThis p }
  "length"				{ \p s -> TLength p }
  "while"				{ \p s -> TWhile p }
  $digit+				{ \p s -> TIntLiteral p (read s) }
  "."                   { \p s -> TPeriod p }
  "&&"			        { \p s -> TOp p (head s) }
  "!"					{ \p s -> TNot p }
  [\+\-\*\/]            { \p s -> TOp p (head s) }
  "<"                   { \p s -> TComOp p (head s) }
  "="					{ \p s -> TEquals p }
  ";" 					{ \p s -> TSemiColon p }
  "("					{ \p s -> TLeftParen p }
  ")"					{ \p s -> TRightParen p }
  $alpha[$alpha $digit \_ \']*	{ \p s -> TIdent p s }
  @string 	       	    { \p s -> TStringLiteral p (init (tail s)) -- remove the leading and trailing double quotes }
  "{"	 	 	   		{ \p s -> TLeftBrace p }
  "}"					{ \p s -> TRightBrace p }
  ","					{ \p s -> TComma p }
  "["					{ \p s -> TLeftBrack p }
  "]"					{ \p s -> TRightBrack p }
  "System.out.println"  { \p s -> TPrint p }
-- Each action has type ::AlexPosn -> String -> Token

Picking one specific example.

$digit+				{ \p s -> TIntLiteral p (read s) }

The “\p s -> TIntLiteral p (read s)” is an anonymous function that is passed two arguments, a String, “s”, and an AlexPosn, “p”. The function returns the token “TIntLiteral” with p as the token position and (read s) as the value of the actual integer.

The only issue with the “posn” wrapper is that the “alexScanTokens” function does not include any information when an error is encountered; it simply aborts with the message “lexical error” with no information on the error line or column number. To fix this we define a new function “alexScanTokens2″ that includes better error reporting. The below code defines two helper functions and the modified “alexScanTokens2″.

getLineNum :: AlexPosn -> Int
getLineNum (AlexPn offset lineNum colNum) = lineNum 

getColumnNum :: AlexPosn -> Int
getColumnNum (AlexPn offset lineNum colNum) = colNum

--alexScanTokens :: String -> [token]
alexScanTokens2 str = go (alexStartPos,'\n',str)
  where go inp@(pos,_,str) =
          case alexScan inp 0 of
                AlexEOF -> []
                AlexError _ -> error ("lexical error @ line " ++ show (getLineNum(pos)) ++ " and column " ++ show (getColumnNum(pos)))
                AlexSkip  inp' len     -> go inp'
                AlexToken inp' len act -> act pos (take len str) : go inp'

main = do
  s <- getContents
  print (alexScanTokens2 s)

The definition of “alexScanTokens2″ may look scary at first if you’re a new Haskell programmer like me but if we break it down, it’s not so complicated.
First the two lines

alexScanTokens2 str = go (alexStartPos,'\n',str)
  where go inp@(pos,_,str) =

define “alexScanTokens2″ to take a String, “str”, as input and defines it to return the value of “go (alexStartPos, ‘\n’, str)”.
The where clause defines what “go (alexStartPost, ‘\n’, str)” is. Which in this case it is

go inp@(pos,_,str)

Please note that the “inp@” gives “inp” the value of “(pos, _, str)”. That is “inp” is defined as a three-tuple. Saying it another way, the “@” is syntactic sugar for defining the variable “inp” as the value of “(pos, _, str)”. To figure out the case expression I recommend looking up the definition of “alexScan”, here, it’s under the “Basic Interface” section.

The only portion of the case expression that I modified from the default “alexScanTokens” function provided by the “posn” wrapper was to change the AlexError case to

AlexError _ -> error ("lexical error @ line " ++ show (getLineNum(pos)) ++ " and column " ++ show (getColumnNum(pos)))

As the code self evidently shows, a simple message is printed giving the line and column number of the error and the program then aborts.

To try out our new scanner you can download it at
http://github.com/bjwbell/NewL-Compiler
under the “scanner_with_error_reporting” directory. There’s a small readme file with instructions on how to test the scanner.

Having Fun with Happy (Compiler Series Part III)

The next step to writing a compiler is creating the parser. For the parser generator I’m using Happy which is the Haskell version of yacc.

To make following along with me easier I’ve uploaded the parser and other files for NewL to github.com at
http://github.com/bjwbell/NewL-Compiler
. The parser is in the parser directory. Please note that the parser contains no error checking. I’ll explain how to add error checking in a later blog post.

To play with the parser and test it go to
http://github.com/bjwbell/NewL-Compiler/archives/master
. It’ll ask you to download the source. Once you’ve downloaded the source and unzipped it, open a terminal and go to the parser directory.

To compile the parser you’ll need to have the Glasgow Haskell Compiler collection installed. You also need to have Happy and Alex installed. If you’re running Ubuntu all of them are available via the Ubuntu repositories. To compile the parser run

make

and to test the parser run

cat test1.newl | ./newl

It will output the parse tree representation of the test1.newl program.

Before going into how I wrote the parser I want to mention that to get the scanner and parser to play together you have to delete

main = do
  s <- getContents
  print (alexScanTokens s)

from scanner.x. Otherwise Alex would generate a standalone scanner.

For an explanation of how the parts to the parser fit together I strongly encourage you to read
http://www.haskell.org/happy/doc/html/sec-using.html
. It has a very well written explanation of writing a parser using Happy.

The first part of the parser is to declare the name and what modules we import.

{
module Main where
import Scanner
}

The important part is “import Scanner”. This imports the scanner that was declared in scanner.x. In particular notice that the name “Scanner” matches the module name declared in scanner.x. The second part is to declare the name of our parser, you can choose any name you like. We also declare the token type which must match the token type declared in scanner.x, finally we give the name of the error function.

%name newl
%tokentype { Token }
%error { parseError }

Now onto to the meat of the parser. We declare a list of tokens as I’ve done below.

%token
  "class"                      { TClass }
  "new"                        { TNew }
  "String"                     { TString }
  "static"                     { TStatic }
  "void"                       { TVoid }
  "main"                       { TMain }
  "public"                     { TPublic }
  "return"                     { TReturn }
  "extends"                    { TExtend }
  "int"	                       { TInt }
  "boolean"                    { TBool }
  "if"                         { TIf }
  "else"                       { TElse }
  "true"                       { TTrue }
  "false"                      { TFalse }
  "this"                       { TThis }
  "length"                     { TLength }
  "while"                      { TWhile }
  integer_literal              { TIntLiteral $$ }
  ident                        { TIdent $$ }
  "{"                          { TLeftBrace }
  "}"                          { TRightBrace }
  ","                          { TComma }
  "["                          { TLeftBrack }
  "]"				           { TRightBrack }
  op                           { TOp $$}
  comop                        { TComOp $$ }
  "("                          { TLeftParen }
  ")"                          { TRightParen }
  ";"                          { TSemiColon }
  "."                          { TPeriod }
  "!"                          { TNot }
  "="                          { TEquals }
  "System.out.println"         { TPrint }
%%

“The symbols on the left are the tokens as they will be referred to in the rest of the grammar, and to the right of each token enclosed in braces is a Haskell pattern that matches the token. The parser will expect to receive a stream of tokens, each of which will match one of the given patterns .” [1] (the definition of the Token datatype is given in scanner.x)

Now we declare the production rules to the grammar (to view the grammar in BNF notation see
http://bloggingmath.wordpress.com/2010/01/20/writing-a-compiler-in-haskell-compiler-series-part-i/
).

Program : 
        MainClass ClassDeclList { Program $1 $2 }
MainClass : 
          "class" ident "{" "public" "static" "void" "main" "(" "String" "[" "]" ident ")" "{" Statement "}" "}" { MClass $2 $12 $15 }


ClassDeclList :
          ClassDecl     { ClassDeclList $1 CEmpty }
          | ClassDecl ClassDeclList { ClassDeclList $1 $2 }
          |             { CEmpty }

ClassDecl : 
          "class" ident "{" VarDeclList MethodDeclList "}"                     { ClassDecl $2 "void" $4 $5 }
          | "class" ident "extends" ident "{" VarDeclList MethodDeclList "}"   { ClassDecl $2 $4 $6 $7 }


MethodDeclList :
     MethodDecl                   { MethodDeclList $1 MEmpty }
     | MethodDecl MethodDeclList  { MethodDeclList $1 $2 }
     |                            { MEmpty }

MethodDecl : 
     "public" Type ident "(" FormalList ")" "{" VarDeclList StatementList "return" Exp ";" "}" { MethodDecl $2 $3 $5 $8 $9 $11 }

VarDeclList :
     Type ident ";" { VarDeclList $1 $2 VEmpty }
     | Type ident ";" VarDeclList { VarDeclList $1 $2 $4 }
     |              { VEmpty }

FormalList :
     Type ident       { FormalList $1 $2 FEmpty }
     | Type ident FormalList { FormalList $1 $2 $3 }

Type :
     "int" "[" "]"    { TypeIntArray }
     | "boolean"      { TypeBoolean }
     | "int"          { TypeInt }
     | ident          { TypeIdent $1 }

Statement :
    "{" StatementList "}"                            { SList $2 }
    | "if" "(" Exp ")" Statement "else" Statement  { SIfElse $3 $5 $7 }
    | "while" "(" Exp ")" Statement                { SWhile $3 $5 }
    | "System.out.println" "(" Exp ")" ";"         { SPrint $3 }
    | ident "=" Exp ";"                              { SEqual $1 $3 }
    | ident "[" Exp "]" "=" Exp ";"                  { SArrayEqual $1 $3 $6 }

StatementList :
    Statement               { StatementList Empty $1 }
    | StatementList Statement   { StatementList $1 $2 }

Exp : 
    Exp op Exp                        { ExpOp $1 $2 $3}
    | Exp comop Exp                   { ExpComOp $1 $2 $3}
    | Exp "[" Exp "]"                 { ExpArray $1 $3}
    | Exp "." "length"                { ExpLength $1}
    | Exp "." ident "(" ExpList ")"   { ExpFCall $1 $3 $5}
    | integer_literal                 { ExpInt $1}
    | "true"                          { ExpBool True}
    | "false"                         { ExpBool False}
    | ident                           { ExpIdent $1}
    | "this"                          { ExpThis }
    | "new" "int" "[" Exp "]"         { ExpNewInt $4 }  
    | "new" ident "(" ")"             { ExpNewIdent $2}
    | "!" Exp                         { ExpNot $2}
    | "(" Exp ")"                     { ExpExp $2}

ExpList :
        Exp            { ExpListExp $1 }
        | Exp ExpRest  { ExpList $1 $2 }
        |              { ExpListEmpty }

ExpRest : 
     "," Exp      { ExpRest $2 }

“Each production consists of a non-terminal symbol on the left, followed by a colon, followed by one or more expansions on the right, separated by |. Each expansion has some Haskell code associated with it, enclosed in braces as usual.” [1]

“The way to think about a parser is with each symbol having a `value’: we defined the values of the tokens above, and the grammar defines the values of non-terminal symbols in terms of sequences of other symbols (either tokens or non-terminals).” [1]

With production rules declared we now declare the data types for them. Which I’ve listed below.

{
parseError :: [Token] -> a
parseError _ = error "Parse error"


data Program 
    = Program MainClass ClassDeclList
      deriving (Show, Eq)



data MainClass
    = MClass String String Statement
      deriving (Show, Eq)

data ClassDeclList
    = ClassDeclList ClassDecl ClassDeclList
    | CEmpty
  deriving (Show, Eq)

data ClassDecl = ClassDecl Ident Ident VarDeclList MethodDeclList
  deriving (Show, Eq)


data MethodDeclList
    = MethodDeclList MethodDecl MethodDeclList
    | MEmpty
    deriving (Show, Eq)
data MethodDecl
    = MethodDecl Type Ident FormalList VarDeclList StatementList Exp
    deriving (Show, Eq)

data VarDeclList =
    VarDeclList Type Ident VarDeclList
    | VEmpty
    deriving (Show, Eq)

data FormalList = 
    FormalList Type Ident FormalList
    | FEmpty
  deriving (Show, Eq)

data Type =
    TypeIntArray
    | TypeBoolean
    | TypeInt
    | TypeIdent Ident
    deriving (Show, Eq)

data Statement
    = Statement String
    | SList StatementList
    | SIfElse Exp Statement Statement
    | SWhile Exp Statement
    | SPrint Exp
    | SEqual Ident Exp
    | SArrayEqual Ident Exp Exp
    | StatementError
    deriving (Show, Eq)

data StatementList
    = StatementList StatementList Statement 
    | Empty
    deriving (Show, Eq)


data Exp
    = Exp String
    | ExpOp Exp Char Exp
    | ExpComOp Exp Char Exp
    | ExpArray Exp Exp -- "Exp [ Exp ]"
    | ExpFCall Exp Ident ExpList -- Exp . Ident ( ExpList )
    | ExpInt Int
    | ExpNewInt Exp
    | ExpBool Bool -- True or False
    | ExpIdent Ident
    | ExpNewIdent Ident -- new Ident ()
    | ExpExp Exp -- Exp ( Exp )
    | ExpThis
    | ExpNot Exp
    | ExpLength Exp
    | ExpError
    deriving (Show, Eq)

data Op
     = And
     | LessThan
     | Plus
     | Minus
     | Times
     deriving (Show, Eq)

type Ident = String
type Integer_Literal = Int
data ExpList
    = ExpList Exp ExpRest
    | ExpListEmpty
    | ExpListExp Exp
    deriving (Show, Eq)
data ExpRest
    = ExpRest Exp
    deriving (Show, Eq)



main = do 
  inStr <- getContents
  let parseTree = newl (alexScanTokens inStr)  
  putStrLn ("parseTree:" ++ show(parseTree))
  print "done"
}

I’ve also defined the “parseError” function and the main function.
The main function uses “getContents” to retrieve the input from stdin and then creates the parse tree using “newl (alexScanTokens inStr)”. Note the “newl” name is the same one we declared using “%name newl” at the beginning of the program.

With the above code we now have a working parser for the NewL language. At this point if there’s an error parsing it barfs out “parse error” and doesn’t tell us anything about the error. Also no type or semantic analysis is done. I’ll expand on both of these points in later posts.

References
[1] Chapter 2. Using Happy,
http://www.haskell.org/happy/doc/html/sec-using.html

Playing with Alex (Compiler Series Part II)

We’re going to use Alex to write the scanner for NewL. Alex is the Haskell equivalent of Lex. The Alex manual is reasonably thorough and has several excellent examples of simple scanners written using it.

The basic idea for Alex is to pass it your Haskell code written up in the special Alex syntax and it spits out a plain Haskell source code file. There are several wrapper options available for using Alex as (1) a standalone scanner or (2) a module that gets compiled in with your parser. For this post we’ll be using Alex as a standalone scanner that takes input from standard in and outputs a list of tokens.

Alex like all other scanner generators uses regular expressions for specifying tokens. The basic format of an Alex file is

regexp { haskell-code }  

where haskell-code is an anonymous function that takes a string (the string that matched the regexp) and returns a token. A typically example is

 [0-9]+ { \s -> TIntLiteral s }

The \s is Haskell’s way of defining an anonymous function with one parameter named s. The “->” signals the start of the function body. Thus “\s -> TIntLiteral s” is a function that takes a String s as the argument and returns TIntLiteral s.

To define our tokens we make a new data type called Token with a list of the possible tokens.

 data Token = TIntLiteral String |
              TIdentifier String 

The above code defines two token types one is a literal int such as “10″ and the other is an identifier such as “myVariable”.

Alex has a nice option %wrapper "basic" that defines the function alexScanTokens which takes a string and returns the list of tokens. Using this we define our main function as

main = do 
           s <- getContents
           print (alexScanTokens s) 

The main function gets the string from the standard input and puts it in s and then uses print (alexScanTokens s) to generate the list of tokens and print them out.
To test the scanner run

echo "10 \"StringLiteral\"" | ./a.out

the above command echos 10 “StringLiteral” to the standard output and then pipes it into the scanner, a.out (I assume a.out is the scanner executable). The output should be [TIntLiteral 10, TStringLiteral "StringLiteral"].

I’ve included a listing of the NewL Alex scanner below. I strongly recommend reading the Alex manual introduction. For compiling the listing use “alex Scanner.x” to generate the Haskell code and then run “ghc Scanner.hs” to generate the executable.


{
module Main (main) where
}

%wrapper "basic"

$digit = 0-9			-- digits
$alpha = [a-zA-Z]		-- alphabetic characters
$graphic    = $printable # $white

@string     = \" ($graphic # \")* \"



tokens :-

  $white+				;
  "class"				{ \s -> TClass }
  "new"				    { \s -> TNew }
  "String"				{ \s -> TString }
  "static"				{ \s -> TStatic }
  "void"				{ \s -> TVoid }
  "main"				{ \s -> TMain }
  "return"              { \s -> TReturn }
  "public"				{ \s -> TPublic }
  "extends"				{ \s -> TExtend }
  "int"					{ \s -> TInt }
  "boolean"				{ \s -> TBool }
  "if"					{ \s -> TIf }
  "else"				{ \s -> TElse }
  "true"				{ \s -> TTrue }
  "false"				{ \s -> TFalse }
  "this"				{ \s -> TThis }
  "length"				{ \s -> TLength }
  "while"				{ \s -> TWhile }
  $digit+				{ \s -> TIntLiteral (read s) }
  "."                   { \s -> TPeriod }
  "&&"				    { \s -> TOp (head s) }
  "!"					{ \s -> TNot }
  [\+\-\*\/]            { \s -> TOp (head s) }
  "<"                   { \s -> TComOp (head s) }
  "="					{ \s -> TEquals }
  ";" 					{ \s -> TSemiColon }
  "("					{ \s -> TLeftParen }
  ")"					{ \s -> TRightParen }
  $alpha[$alpha $digit \_ \']*	{ \s -> TIdent s }
  @string 	       	  		{ \s -> TStringLiteral (init (tail s)) -- remove the leading " and trailing " }
  "{"	 	 	   		{ \s -> TLeftBrace }
  "}"					{ \s -> TRightBrace }
  ","					{ \s -> TComma }
  "["					{ \s -> TLeftBrack }
  "]"					{ \s -> TRightBrack }
  "System.out.println"  { \s -> TPrint }
{
-- Each action has type :: String -> Token

-- The token type:
data Token =
    TLeftBrace  |
	TRightBrace |
	TComma |
	TLeftBrack |
	TRightBrack |
	TClass |
	TPublic |
	TString |
	TStatic |
	TVoid |
	TMain	 |
	TExtend |
	TInt |
	TBool |
	TIf |
	TElse |
	TTrue |
	TFalse |
	TThis |
	TLength |
	TWhile |
	TNew |
	TOp Char |
	TComOp Char |
    TNot |
	TEquals |
	TPeriod |
	TSemiColon |
	TLeftParen |
	TRightParen |
	TIdent String |
    TPrint |
	TIntLiteral Int |
	TStringLiteral String |
    TReturn
	deriving (Eq,Show)

main = do
  s <- getContents
  print (alexScanTokens s)

}

Writing A Compiler In Haskell (Compiler Series Part I)

My goal is to write a series of posts on creating a compiler for a restricted subset, NewL, of the Java language in Haskell. My tools are Haskell, Alex for scanning, and Happy for parsing.

Alex and Happy are basically the Haskell equivalent of Lex and Yacc. I know most of my compiler theory and practices from the Dragon book so I’ll be making references to it for the techniques used in semantic analysis and code generation. I’ve learned everything I know about Haskell from the really good book “Real World Haskell“, I cannot recommend it highly enough.

I plan to cover the following topics with Haskell source code and make files included.

1. Go over the grammar of NewL (this post).
2. Scanner for NewL using Alex.
3. Parser for NewL using Happy.
4. Error recover and reporting for the scanner and parser (I’m keeping it simple and abort on the first error and print the line and column number)
5. Symbol table and how it’s used in semantic analysis and code generation.
6. Revised NewL language and more symbol tables.
7. Type checking.
8. Intermediate code generation.
9. Machine code generation.
10. Garbage collection.

As you can see that’s a lot of stuff to cover. I’m warning you ahead of time that I’m very much a newbie when it comes to writing Haskell. I don’t even understand monads, so be forewarned the code get ugly compared to a true Haskell programmer.

With the above out of the way I’d like to go over the grammar of NewL. Please note that the semantics of NewL are similar to Java so if there’s any question about what some operation should do, we follow what Java does.

Terminal symbols are bolded and single character ones are in double quotes.

NewL Grammar

1. Program ::= MainClass { ClassDecl }
2. MainClass :== class Ident “{public static void main “(” String “[" "]“ Ident “)” “{” Statement “}” “}
3. ClassDecl ::= class Ident “{” {VarDecl} {MethodDecl} “}
4. VarDecl ::= Type Ident “;
5. MethodDecl ::= public Type Ident “(” FormalList “)” “{” {VarDecl} {Statement} return Exp “;” “}
6. FormalList :== Type Ident { FormalRest } | ε
7. FormalRest :== “,” Type Ident
8. Type :== int “[" "]“ | boolean | int | String | Ident
9. Statement :== “{” { Statement } “}” | if “(“ Exp “)” Statement else Statement | while(” Exp “)” Statement | System.out.println(” Exp “)” “;” | Ident = Exp “;” | Ident “[" Exp "]” “=” Exp “;
10. Exp ::= Exp Op Exp | Exp “[" Exp "]” | Exp “.length | Exp “.” Ident “(” ExpList “)” |
Integer_Literal | String_Literal | true | false | Ident | this | new int[" Exp "]” | new Ident “(” “)” | “!” Exp | “(” Exp “)
11. ExpList :== Exp { ExpRest } | ε
12. ExpRest ::= “,” Exp
13. Ident ::= Letter { Letter | Digit | “_” }
14. Integer_Literal ::= Digit { Digit }
15. String_Literal ::= “” { Character } “
16. Op ::= && | < | +| - | *

The classic Hello World for NewL

class HelloWorld {
    public static void main(String[] a) {
        System.out.println(new Hello().World());
    }
}
class Hello {
    public String World() {
        return "Hello World";
    }
}