ANTLR v3 - PowerPoint PPT Presentation

1 / 27
About This Presentation
Title:

ANTLR v3

Description:

DFA yields predicted alt number. Grammar actions are not sucked ... Depth-first walk of NFA at left edge of each alt, popping state off stack upon end-of-rule. ... – PowerPoint PPT presentation

Number of Views:335
Avg rating:3.0/5.0
Slides: 28
Provided by: terenc2
Category:
Tags: antlr | alt

less

Transcript and Presenter's Notes

Title: ANTLR v3


1
ANTLR v3 LL() Parsing
  • Terence Parr
  • University of San Francisco
  • _at_
  • Coverity
  • July 2006

2
Topics
  • Research goals and motivation
  • LL(k) parsing background
  • LL() solution
  • How it works
  • Auto-backtracking extension
  • Some initial results

3
Research goals
  • Make top-down LL-based parsers as powerful as
    possible
  • allows more natural grammars
  • makes language tools more accessible
  • My research constrained by what most programmers
    can/will use
  • recursive-descent parsers must be the base
  • k1 fixed lookahead
  • semantic predicates
  • syntactic predicates controlled backtracking and
    means of specifying ambiguity resolution
  • And for my next trick LL()

4
Recent advances
  • GLR Tomita handles CFGs like Earley but much
    more efficient uses LR(1) forking of new
    parsers at nondeterministic states
  • Elkhound McPeak reduces forking further (even
    in nondeterministic situations)
  • PEG (parser expression grammar) Ford formalizes
    ordered productions and syntactic predicates
    from PCCTS/ANTLR backtracks through
    alternatives taking first match no strict
    ordering with GLR
  • Packrat parsing Ford (see Rats! Grimm)
    memoizes partial parsing results to guarantee
    linear time but with biggish heap
  • Dramatic foreshadowing LL() is to packrat as
    Elkhound GLR is to Earleys algorithm

5
LL() Motivation
  • Natural grammars sometimes not LL(k) e.g.
    abstract vs concrete methods
  • From the left edge, lookahead unbounded to see
    the vs . We need arbitrary lookahead
    because of the arg
  • If you have actions after ID, cant easily
    refactor
  • Lookahead will be 5k10 usually for this decision

method type ID ( arg ) type
ID ( arg ) body
6
Another non-LL(k) grammar
  • Cant see past modifier here
  • Could left-factor, but not always possible and
    its unnatural!

def modifier classDef modifier
interfaceDef
def modifiers (classDefinterfaceDef)
7
Background LL parsers
  • Building a parser generator is easy except for
    the lookahead analysis
  • rule ref ? rule()
  • token ref ? match(token)
  • rule def ? void rule() if (
    lookahead-expr-alt 1 ) match alt 1 else if
    ( lookahead-expr-alt 2 ) match alt 2 else
    error
  • The nature of the lookahead expressions dictates
    the strength of your parser generator

8
LL(2) parser example
void stat() if ( LA(1)IDLA(2)EQUALS )
match(ID) match(EQUALS) expr()
else if ( LA(1)IDLA(2)COLON )
match(ID) match(COLON) stat()
else error
stat ID expr ID stat
Lookahead is set of2-sequences that indicate
which alternative willultimately succeed
9
Lookahead as DFA
void a() int alt0 if ( LA(1)ID )
if ( LA(2)EQUALS ) alt1 if ( LA(2)COLON
) alt2 switch (alt) case 1
match(ID) match(EQUALS) expr()
case 2 match(ID) match(COLON)
stat() default error
10
Solution overview
  • Natural extension to LL(k) lookahead DFA Allow
    cyclic DFA that can skip ahead past the modifiers
    to class or interface def
  • Dont approximate entire CFGwith a regex i.e.,
    dont includeclass or interface def rules
  • Just predict and proceed normallywith LL parse
  • DFA yields predicted alt number
  • Grammar actions are not suckedinto DFAs and
    arent executed duringprediction
  • No need to specify k a priori

11
LL() code
  • Arbitrary cyclic graphs cant be encoded w/o
    gotos in Java, but here a simple while is ok

void a() int alt0 while (LA(1) in
modifier) consume() if ( LA(1)CLASS )
alt1 if ( LA(1)INTERFACE ) alt2 switch
(alt) case 1 case 2 default
error
12
Isnt that just backtracking?
  • No. For example, if I can guarantee you will
    never lookahead more than 10 symbols, it's just
    LL(10), right?
  • Not backtracking with the parser. DFA is smaller
    and faster e.g., DFA predicting expr does not
    follow deep call chain parser does
  • Dont have to avoid or unroll arbitrary user
    actions in grammar!
  • The DFAs are efficiently coded and automatically
    throttle down when less lookahead is needed

13
LL() DFAConstruction Algorithm
14
Algorithm discussion
  • need suitable grammar representation
  • sample LL(2) lookahead set computation
  • LL() algorithm outline
  • sample LL() DFA construction

15
Lookahead NFA Construction
a
b
a b X b Y b B
16
Sample LL(2) Lookahead Set
  • Now, consider a simple fixed k lookahead
    computation algorithm
  • Depth-first walk of NFA at left edge of each alt,
    popping state off stack upon end-of-rule.
    Terminate paths after traversing k2nd
    non-epsilon edge
  • Lookahead for rule a alt 1 state sequence4,
    2, 21,16 B 17, 20, 3, pop, 5 X 722, 18, 19, 20,
    3, pop, 5 X 7, 13, 1,
  • Yields BX,X

17
LL() Algorithm Outline
  • idea perform a breadth-first search of the NFA,
    carrying the stack context along with it so it
    knows where to return upon end-of-rule NFA state
  • modify classical NFA-to-DFA conversion (subset
    construction algorithm)
  • DFA state encodes configurations NFA could be in
    after having seen input sequence including call
    invocation stack
  • NFA configuration (saltcontext) tracks state,
    predicted alt, and rule invocation stack to get
    to that state
  • terminate algorithm when state uniquely predicts
    an alternative or nondeterminism found (sictx)
    and (sjctx) for same state s but different alts
    i,j and same/similar context
  • verify DFA is reduced and all alternatives have
    predict state

18
LL() DFA Conversion
Classic DFA
a b X b Y b B
LL() DFA
1 alt
same NFA state, diff context
19
Successful termination
a A X R A Y S
a (AA) B
DFA
DFA
LL()
LL()
Stops as ambiguity or unique prediction
20
Cant see past recursion
  • LL() DFA construction takes LL stack into
    consideration, but resulting DFA will not have
    stack uses sequence of states instead
  • Example weakness (same language, diff grammar)

// works a b X b Y b A
// doesnt work a b X b Y b A A b
// tail recursion
t.g25 Alternative 1 after matching input such
as A A A A decision cannot predict what comes
next due to recursion overflow to b from
b t.g25 Alternative 2
21
LL() analysis fails sometimes
  • LL() algorithm is exponential like subset
    constr. algorithm worst case
  • keeps looking for more lookahead to distinguish
    alternatives a problem in big grammars
  • doesnt like common recursive prefixes
  • w/o failsafe would not terminate in our lifetime
  • Workarounds
  • manually set fixed k lookahead if possible
  • syntactic predicates
  • auto-backtracking mode
  • refactor grammar if ambiguous or to reduce
    lookahead requirements

22
Auto-Backtracking
  • Idea when LL() analysis fails, simply backtrack
    at runtime to figure it out
  • newbie or rapid prototyping mode
  • people dump the craziest stuff into ANTLR
  • impl add syntactic predicate to each alt left
    edge
  • LL() alg. uses preds only in nondeterministic
    states(NFA config. extended to include semantic
    context)
  • Use fixed k lookaheadbacktracking to get grammar
    working then optimize with LL()
  • ANTLR v3 can memoize parsing results to guarantee
    linear parsing time
  • Demo java parsing with, w/o memoization

23
LL()Auto-Backtracking
grammar r options backtracktrue s e ''
e '' e '(' e ')' INT
24
Java 1.4 Grammar Results
  • Tweaked version of Rats!s Java grammar
  • 99 Rules, 86 decisions
  • LL(1) decisions 68 (excluding 2 that backtrack)
  • LL(2) decisions 12
  • LL() decisions 4
  • Backtracking decisions 2
  • No heap wasted on memoization (memo. off)!
  • If limited to k1, 10 decisions backtrack
  • Prelim. parsing profile on java/awt/Container.java
    LL() lookahead range 1..8 tokens average
    2Backtracking range 1..8 average 3.5

25
Can we classify LL() strength?
  • No strict ordering with CFG (ala GLR) Grammar
    context-sensitive AnBnCn?

s (a) A b EOF A b EOF a A a
B b B b C
  • production forces decision
  • else predicate (a) not used
  • language unaffected

matches AnBn
matches BnCn
Adapted from Fords PEG paper
26
LL() vs LR(k)
  • LR(k) even with k1 is generally more powerful
    than LL() or at least more efficient for same
    grammar, but there is no strict ordering add
    epsilon rule refs to left edge of our grammar and
    its not LR(k) for fixed k derived from adding
    actions

a b A X R c A Y S b c
LL() but not LR(k) due to reduce-reduceconflict
27
Summary and Conclusions
  • Brazen assertion LL() syntactic predicates is
    the most powerful parsing strategy that is
    accessible/attractive to average programmer
  • LL() has benefits, flexibility, simplicity of LL
    but is much stronger supports natural grammars
  • Doesn't alter recursive descent parser itself at
    all just enhances the predictive capabilities.
  • Unifies lexing, parsing, tree parsing
  • Basic algorithm is not that complicated, but
    making it real and useful is interesting
  • Beta-release http//www.antlr.org/v3
  • BSD license
Write a Comment
User Comments (0)
About PowerShow.com