Title: A (Long) Introduction to AntLR
1A (Long) Introductionto AntLR
- Slides adapted from
- AntLR Reference Manual by Terence Pratt
- antlr.org/share/1084743321127/ANTLR_Reference_Manu
al.pdf - AntLR Tutorial by Ashley J.S Mills
- http//supportweb.cs.bham.ac.uk/docs/tutorials/doc
system/build/tutorials/antlr/antlrhome.html - An Introduction to AntLR by Terence Pratt
- http//www.cs.usfca.edu/parrt/course/652/lectures
/antlr.html - An AntLR Tutorial by Scott Stanchfield
- javadude.com/articles/antlrtut/
2AntLR
- ANother Tool for Language Recognition
- (or anti-LR??)
- a LL(k) parser and translator generator tool
- which can create
- lexers
- parsers
- abstract syntax trees (ASTs)
-
- in which you describe the language grammatically
- and in return receive a program that can
recognize and translate that language
3Tasks Divided
- Lexical Analysis (scanning)
- Semantic Analysis (parsing)
- Tree Generation
- Code Generation
4Lexer
- A source file is streamed to a lexer on a
character by character basis by some kind of
input interface. - Lexer groups characters into meaningful tokens
that are meaningful to the parser. - A token may be
- keywords
- identifiers
- symbols
- operators
- Lexer also removes comments and whitespace from
the program, which are meaningless to the parser. - So it creates a stream of tokens, which are
received one by one by the parser.
5Parser
- Parser organizes the tokens into the allowed
sequences defined by the grammar of the language.
- If the parser encounters a sequence of tokens
that match none of the allowed sequences of
tokens, it will issue an error - A design choice is whether to try to recover from
the error by making assumptions. - Parsers may either do syntax-directed translation
on-the-fly, - or convert the sequences of tokens into an
Abstract Syntax Tree (AST). - An AST is a structure which
- keeps information in an easily traversable form
(such as operator at a node, operands at children
of the node) - ignores form-dependent superficial details
- More on ASTs later...
-
- Parser also generates one or more symbol table(s)
which contain information, about the tokens it
encounters.
6What does a grammar file look like?
- It is composed of rules
- ANTLR accepts three types of grammar
specifications - parsers
- lexers
- tree-parsers (also called tree-walkers)
- Uses LL(k) analysis for all
- So the grammar specifications are similar, and
the generated lexers and parsers behave similarly
7Sample File
- taken from AntLR tutorial of Ashley J.S Mills
8Sample File Divided (1/3)
- An arbitrary number of parsers, lexers, and
tree-parsers in a grammar file - a separate class file will be generated for each
- i.e, YourLexerClass.class, YourParserClass.class,
YourTreeParserClass.class - Header
- put preamble that will be put on top of each of
these classes - an import, maybe?
9Sample File Divided (2/3)
- Options
- file-wide
- charVocabulary '\0'..'\377' //defines the
alphabet (usage in complement and wildcard) - k2 // means two characters of lookahead
- Class specific
- ... header for parser class only ...
- class MyParser extends Parser
- options ...parser options...
-
- parser class members
-
- parser rules
10Sample File Divided (3/3)
- taken from AntLR tutorial of Ashley J.S Mills
You simply list a set of lexical rules that match
tokens. The tool automatically generates code to
map the next input character(s) to a rule likely
to match. A big "switch that routes
recognition flow to the appropriate rule
11Symbols in AntLR
- taken from AntLR reference manual
12Lexer
- taken from AntLR tutorial of Ashley J.S Mills
- With one restriction
- Rules defined within a lexer grammar must have a
name beginning with an uppercase letter
13Lexer Rules
You can define operators like BECOMES
COLON SEMI EQUALS
LBRACKET RBRACKET LPAREN
( RPAREN ) LT lt LTE
lt PLUS MINUS - TIMES
DIV / And then you can
define a token class such as OPS (PLUS MINUS
MULT DIV)
14Actions
- Blocks of source code (expressed in the target
language) enclosed in curly braces - Executed
- after the preceding production element has been
recognized - before the recognition of the following element
- Typically used to generate output, construct
trees, or modify a symbol table - Position dictates when it is recognized relative
to the surrounding grammar elements. - If the first element of a production, it is
executed before any other element in that
production, but only if that production is
predicted by the lookahead - rule_name
- (
- init-action
- action of 1st production production_1
- action of 2nd production production_2
- )?
15Tip Skipping Tokens
- A white space has nothing to do in a grammar
- WS
- ( \n \t)
- setType(Token.SKIP) ? action
-
- ? Do not pass this token to the parser.
Recognize it and then throw it away. - Same for comments )
16Tip Newline Stuff
- Line number of input is used for reporting error
- Must be incremented by hand when lexer encounters
a newline -
- WS
- ( ' ' '\t' '\f'
-
- // handle newlines
- (
- "\r\n" // DOS/Windows
- '\r' // Macintosh
- '\n' // Unix )
- // increment the line count
- newline() ? action executed only in
this case ) - setType(Token.SKIP)
-
17Parser
- class ExprParser extends Parser
- expr
- mexpr ((PLUSMINUS) mexpr)
- mexpr
- atom (STAR atom)
- atom
- INT
- LPAREN expr RPAREN
- Rules defined within a parser grammar must have a
name beginning with a lowercase letter -
18Tip Keywords and Literals (1/2)
- Many languages have a general "identifier"
lexical rule, and keywords that are special cases
of the identifier pattern - A typical identifier token may be defined as
- ID LETTER (LETTER DIGIT)
- So how can AntLR understand if is not an
identifier? - You put fixed keywords into a literals table.
- checked after each token is matched
- Any double-quoted string used in a parser is
automatically entered into the literals table of
the associated lexer. - subprogramBody
- (basicDecl)
- (procedureDecl)
- "begin"
- (statement)
- "end" IDENT
19Tip Keywords and Literals (2/2)
- option testLiterals
- By default, ANTLR will generate code in all lexer
rules to test each token against the literals
table - However, you may suppress this code generation in
the lexer by using a grammar option -
- class L extends Lexer
- options testLiteralsfalse
- ...
- If you turn this option off for a lexer, you may
re-enable it for specific rules - ID options testLiteralstrue
- LETTER (LETTER DIGIT)
20Tip Token Object Creation
- You will sometimes want to access information
about the token being matched - Label lexical rules and obtain a Token object
representing the text, token type, line number,
etc... matched for that rule reference - Lexer rule
- INT ('0'..'9')
- Parser rule
- INDEX
- '' iINT ''
- System.out.println(i.getText())
21Tip Syntactic / Semantic Predicates
- There are other situations where you have to turn
on and off certain rules - depending on prior context or semantic
information - Use predicates to decide
22Syntactic Predicates
- ANTLR (tree) parsers usually use only a single
symbol of lookahead, which is normally not a
problem as intermediate forms are explicitly
designed to be easy to walk - However, there is occasionally the need to
distinguish between similar tree structures - Syntactic predicates can be used to overcome the
limitations of limited fixed lookahead - For example, distinguishing between the unary and
binary minus operator - expr ( (MINUS expr expr) )gt ( MINUS expr expr
) - ( MINUS expr )
- ...
-
- The order of evaluation is very important as the
second alternative is a "subset" of the first
alternative - Syntactic predicates are a form of selective
backtracking and, therefore, actions are turned
off while evaluating a syntactic predicate so
that actions do not have to be undone
23Semantic Predicates
- Semantic predicates
- at the start of an alternative decides whether
or not to match - in the middle of productions throw exceptions
when they evaluate to false - stat
- isTypeName(LT(1))? ID ID " // declaration
"type varName" - ID "" expr "" // assignment
-
- decl "var" ID "" tID
- isTypeName(t.getText()) ? //used to throw an
exception -
24Eg Keeping State Information
- Context-sensitive recognition example
- If you are matching tokens that separate rows of
data such as "----", you probably only want to
match this if the "begin table" sequence has been
found - BEGIN_TABLE
- '' this.inTabletrue // enter table context
-
- ROW_SEP
- this.inTable? "---- // sematic predicate
-
- END_TABLE
- '' this.inTablefalse // exit table context
-
25The Java Code
- The code to invoke the parser
- import java.io.
- class Main
- public static void main(String args)
- try
- // use DataInputStream to grab bytes
- MyLexer lexer new MyLexer(new
DataInputStream(System.in)) - MyParser parser new MyParser(lexer)
- int x parser.expr()
- System.out.println(x)
- catch(Exception e)
- System.err.println("exception "e)
-
-
26Running AntLR
- In Linux
- runantlr ltantlr_filegt.g
- javac .java
- java Main
- In Windows
- Eclipse has a very easy-to-use plugin for AntLR
- http//antlreclipse.sourceforge.net/ for very
very detailed instructions - The plugin will run AntLR on the grammar file
27Expression Evaluation 1 Syntax-Directed
Translation
- To evaluate the expressions on the fly as the
tokens come in, add actions to the parser - class ExprParser extends Parser
- expr returns int value0 int x
- valuemexpr
- (
- PLUS xmexpr value x
- MINUS xmexpr value - x
- )
- mexpr returns int value0 int x
- valueatom
- ( STAR xatom value x )
- atom returns int value0
- iINT valueInteger.parseInt(i.getText())
- LPAREN valueexpr RPAREN
28Expression Evaluation 2 via AST Intermediate
Form
- A more powerful strategy than syntax-directed
translation is - to build an AST
- intermediate representation that holds all or
most of the input symbols and has encoded, in the
structure of the data, the relationship between
those tokens - For this kind of tree, you will use a tree walker
to compute the same values as before, but using a
different strategy - The utility of ASTs becomes clear when you must
do multiple walks over the tree to figure out
what to compute or to do tree rewrites, morphing
the tree towards another language.
29Abstract Syntax Trees
- Abstract Syntax Tree Like a parse tree, without
unnecessary information - Two-dimensional trees that can encode the
structure of the input as well as the input
symbols - Either
- homogeneous all objects of the same type e.g.,
CommonAST in ANTLR - or heterogeneous multiple types such as
PlusNode, MultNode... - An AST for (34) might be represented as
- No parantheses are included in the tree!
30AST Construction
- To get ANTLR to generate a useful AST
- turn on the buildAST option
- add a few suffix operators
- class ExprParser extends Parser
- options buildASTtrue
- expr mexpr ((PLUSMINUS) mexpr)
- mexpr atom (STAR atom)
- atom INT LPAREN! expr RPAREN!
- No changes in the Lexer.
31AST Operators
- AST root operator
- Normally AntLR makes the first token it
encounters the root of the tree - We usually want to manipulate this, eg, for
operators -
- A token suffixed with the root operator
forces that token as the root of the current
tree - expr mexpr ((PLUSMINUS) mexpr)
- AST exclude operator.
- Tokens / rule references suffixed with the
exclude operator are not included in the AST - eg, for parantheses
-
- atom INT LPAREN! expr RPAREN!
32AST Parsing and Evaluation
- Rule format is like (A B C)
- which means "match a node of type A, and then
descend into its list of children and match B and
C". - This notation can be nested arbitrarily, using
(...) for child trees - eg, (A B (C D) )
- class ExprTreeParser extends TreeParser
- expr returns int r0 int a,b
- (PLUS aexpr bexpr) r ab
- (MINUS aexpr bexpr) r a-b
- (STAR aexpr bexpr) r ab
- iINT r (int)Integer.parseInt(i.getText())
- Important Sufficient matches are not exact
matches. As long as the tree satistfies the
pattern, a match is reported, regardless of how
much is left unparsed - ( A B ) ( A (B C) D).
33in Java
- The code to launch the parser and the tree
walker - import java.io.
- import antlr.CommonAST
- import antlr.collections.AST
- class Calc
- public static void main(String args)
- try
- CalcLexer lexer new CalcLexer(new
DataInputStream(System.in)) - CalcParser parser new CalcParser(lexer)
- parser.expr() // Parse the input expression
- CommonAST t (CommonAST)parser.getAST()
- System.out.println(t.toStringList()) // Print
the resulting tree out in LISP notation - CalcTreeWalker walker new CalcTreeWalker()
// Traverse the tree created by the parser - int r walker.expr(t)
- System.out.println("value is "r)
- catch(Exception e)
- System.err.println("exception "e)
34AST Construction by Hand
- In some cases, you may want to transfom a tree
yourself. eg, Optimization of addition with zero - class CalcTreeWalker extends TreeParser
- options buildAST true // "transform" mode
- expr
- ! (PLUS leftexpr rightexpr) // '!' turns off
auto transform -
- if ( right.getType()INT
Integer.parseInt(right.getText())0 ) // x0
x -
- expr left
-
- else if ( left.getType()INT
Integer.parseInt(left.getText())0 ) // 0x x -
- expr right
-
- else // xy
-
- expr (PLUS, left, right)
35in Java
- The code to launch the parser and tree trasformer
is - import java.io.
- import antlr.CommonAST
- import antlr.collections.AST
- class Calc
- public static void main(String args)
- try
- CalcLexer lexer new CalcLexer(new
DataInputStream(System.in)) - CalcParser parser new CalcParser(lexer)
- parser.expr() // Parse the input expression
- CommonAST t (CommonAST)parser.getAST()
- System.out.println(t.toLispString()) // Print
the resulting tree out in LISP notation - CalcTreeWalker walker new CalcTreeWalker()
- walker.expr(t) // Traverse the tree created by
the parser - t (CommonAST)walker.getAST() // Get the
result tree from the walker - System.out.println(t.toLispString())
- catch(Exception e)
- System.err.println("exception "e)
36Left Recursion Solved
- E ? E T T written in AntLR as expr expr PLUS
term term - The code generated checks for expr infinitely
- expr()
-
- expr()
- match(PLUS)
- expr()
-
- Eliminate left recursion by
- E ? TE
- E ? TE e
- results in
- expr term (PLUS term)
37Links
- AntLR Reference Manual by Terence Pratt
- antlr.org/share/1084743321127/ANTLR_Reference_Manu
al.pdf - AntLR Tutorial by Ashley J.S Mills
- http//supportweb.cs.bham.ac.uk/docs/tutorials/doc
system/build/tutorials/antlr/antlrhome.html - An Introduction to AntLR by Terence Pratt
- http//www.cs.usfca.edu/parrt/course/652/lectures
/antlr.html - An AntLR Tutorial by Scott Stanchfield
- javadude.com/articles/antlrtut/