Parsing 1 of 2 - PowerPoint PPT Presentation

1 / 23
About This Presentation
Title:

Parsing 1 of 2

Description:

Reminder: Front End. Recognise legal procedure. Report errors, produce IR. Much of front end construction can be automated, which is exactly what we will do with ANTLR ... – PowerPoint PPT presentation

Number of Views:33
Avg rating:3.0/5.0
Slides: 24
Provided by: mcsV
Category:
Tags: frontend | parsing

less

Transcript and Presenter's Notes

Title: Parsing 1 of 2


1
Parsing 1 of 2
  • COMP431 (Compilers) by Dr Alex Potanin

2
Reminder Front End
tokens
scanner
source code
parser
IR
errors
  • Recognise legal procedure
  • Report errors, produce IR
  • Much of front end construction can be automated,
    which is exactly what we will do with ANTLR
  • Scanner performs lexical analysis by breaking the
    input into individual words or tokens. Parser
    performs syntax analysis by parsing the phrase
    structure of the program.

3
Reminder Front End (Scanner)
tokens
scanner
source code
parser
IR
errors
  • Maps characters into tokens the basic unit of
    syntax
  • x x y becomes idxidxidy
  • Typical tokens number, id, , -, , /, do, end
  • Eliminates white space (tabs, blanks, comments)
  • A key issue is speed gt use specialised
    recogniser rather than an automatically generated
    one

4
Specifying Patterns
  • A scanner must recognise the units of syntax
  • Some parts are easy
  • White space
  • ltwsgt ltwsgt
  • ltwsgt \t
  • \t
  • Keywords and operators
  • Specified as literal patterns do, end
  • Comments
  • Opening and closing delimiters / /

5
Specifying Patterns
  • A scanner must recognise the units of syntax
  • Other parts are much harder
  • Identifiers
  • Alphabetic followed by k alphanumerics (_, , ,
    )
  • Numbers
  • Integers 0 or digit from 1-9 followed by digits
    from 0-9
  • Decimals integer . digits from 0-9
  • Reals (integer or decimal) E ( or -) digits
    from 0-9
  • Complex ( real , real )
  • We need a powerful notation to specify these
    patterns

6
Regular Expressions
  • Patterns are often specified as regular languages
    described by regular expressions
  • Regular expressions (over an alphabet S)
  • eis an RE denoting the set e
  • if a ? S, then a is a RE denoting a
  • if r and s are REs, denoting L(r) and L(s),
    then
  • (r) is a RE denoting L(r)
  • (r) (s) is a RE denoting L(r)?L(s) -
    alternation
  • (r)(s) is a RE denoting L(r)L(s) - concatenation
  • (r) is a RE denoting L(r) - closure
  • If we adopt a precedence for operators, the extra
    parentheses can go away. We assume closure, then
    concatenation, then alternation as the order of
    precedence.

7
Examples
  • Identifier
  • letter -gt (abczABCZ)
  • digit -gt (0123456789)
  • id -gt letter (letter digit)
  • Numbers
  • integer -gt (-e)(0 (1239) digit)
  • decimal -gt integer . ( digit )
  • real -gt ( integer decimal ) E ( -) digit
  • complex -gt ( real , real ) - ) or \) is a
    character
  • Numbers can get much more complicated
  • Most programming language tokens can be described
    with REs
  • We can use REs to build scanners automatically

8
Recognisers
  • From a regular expression we can construct a
    Deterministic Finite Automaton (DFA)
  • Recogniser for identifier ( id -gt letter (letter
    digit) )

letter digit
letter
other
0
1
2
accept
digit other
3
error
9
Lecture 3
  • Begins here
  • Clarification on strength reduction
  • for () x x (a b) pull (a b) out
  • for ( i lt coll.size() ) coll.get(i)
  • - can we pull coll.size() out?

10
Automatic Construction
  • Scanner generators automatically construct code
    from RE-like descriptions
  • construct a DFA
  • use state minimisation techniques
  • emit code for the scanner (table driven or direct
    code)
  • A key issue in automation is an interface to the
    parser
  • lex is a scanner generator supplied with UNIX
  • emits C code for scanner
  • provides macro definitions for each token (used
    in the parser)
  • ANTLR combines the scanner and the parser
    together
  • Others include JavaCC, SableCC, YACC, etc.

11
Grammars for Regular Languages
  • For any regular expression there exists a grammar
    that describes the same language. (provable fact)
  • Regular grammars have productions in one of two
    forms (A is any non-terminal and a is any
    terminal symbol)
  • A -gt aA
  • A -gt a

12
More Regular Expressions
  • What about RE (a b) abb ?
  • State 0 has multiple transitions on a!
  • Called Nondeterministic Finite Automaton (NFA)
  • DFAs are clearly a subset of NFAs
  • Any NFA can be converted into a DFA, by
    simulating sets of simultaneous states
  • But, possible exponential blowup

ab
a
b
b
0
1
3
2
13
Limits of Regular Languages
  • Not all languages are regular
  • One cannot construct DFAs to recognise these
    languages
  • L pk qk
  • L wcwr w ? S
  • NB! Neither of these is a regular expression!
    (DFAs cannot count)
  • But, this is a little subtle. One can construct
    DFAs for
  • Alternating 0s and 1s (e1)(01)(e0)
  • Sets of pairs of 0s and 1s (01 10)

14
So what is hard?
  • Language features that can cause problems
  • Reserved words
  • PL/I had no reserved words
  • if then then then else else else then
  • Significant blanks
  • FORTRAN and Algol68 ignore blanks
  • String constants
  • Finite closures (e.g. ak rather than a)
  • Some languages limit identifier lengths and add
    states to count length (FORTRAN 66 had 6
    character limit)
  • These can be swept under the rug in the language
    design

15
Reminder Front End (Parser)
tokens
scanner
source code
parser
IR
errors
  • Recognise context-free syntax
  • Guide context-sensitive analysis
  • Construct IR(s)
  • Produce meaningful error messages
  • Attempt error correction
  • Parser generators mechanise much of the work
    (e.g. ANTLR)

16
Front End (Parser)
  • Context-free syntax is specified with a grammar
  • ltsheep noisegt baa
  • baa ltsheep noisegt
  • (The noises sheep make under normal
    circumstances)
  • This format is called Backus-Naur form (BNF)
  • Formally, a grammar G (S, N, T, P) where
  • S is the start symbol
  • N is a set of non-terminal symbols
  • T is a set of terminal symbols
  • P is a set of productions or rewrite rules
  • P N ? N?T

17
Front End (Parser)
  • Context-free syntax example
  • 1 ltgoalgt ltexprgt
  • 2 3 ltexprgt ltexprgt ltopgt lttermgt lttermgt
  • 4 5 lttermgt number id
  • 6 7 ltopgt -
  • Simple expressions with addition and subtraction
    over tokens id and number
  • S ltgoalgt
  • T number, id, , -
  • N ltgoalgt, ltexprgt, lttermgt, ltopgt
  • P 1, 2, 3, 4, 5, 6, 7

18
Front End (Parser)
  • Given a grammar, valid sentences can be derived
    by repeated substitution.
  • Prodn Result
  • ltgoalgt
  • 1 ltexprgt
  • 2 ltexprgt ltopgt lttermgt
  • 5 ltexprgt ltopgt y
  • 7 ltexprgt - y
  • 2 ltexprgt ltopgt lttermgt - y
  • 4 ltexprgt ltopgt 2 - y
  • 6 ltexprgt 2 - y
  • 3 lttermgt 2 - y
  • 5 x 2 - y
  • To recognise a valid sentence in some CFG, we
    reverse this process and build up a parse

19
Front End (Parser)
  • A parse can be represented by a parse tree or
    syntax tree. Obviously, this contains a lot of
    unnecessary information.

goal
expr
op
expr
term
op
expr
term
-
ltidygt
ltnum2gt

term
ltidxgt
20
Front End (Parser)
  • So, compilers often use an abstract syntax tree
  • This is much more concise
  • Abstract syntax trees (ASTs) are often used as
    an IR between front end and back end (e.g. inside
    javac)

-

ltidygt
ltnum2gt
ltidxgt
21
Derivations
  • View the productions of a CFG as rewriting rules
  • The process of discovering a derivation (a
    sequence of rewrites) is called parsing
  • At each step, we chose a non-terminal to replace
  • This can lead to different derivations, but two
    are of particular interest
  • leftmost derivation the leftmost non-terminal is
    replaced at each step
  • rightmost derivation the rightmost non-terminal
    is replaced at each step

22
Ambiguity
  • If a grammar has more than one derivation for a
    single sentential form, then it is ambiguous
  • For example
  • ltstmtgt if ltexprgt then ltstmtgt
  • if ltexprgt then ltstmtgt else ltstmtgt
  • other stmts
  • Consider deriving the sentential form
  • if E1 then if E2 then S1 else S2
  • It has two derivations.
  • This ambiguity is purely grammatical and is
    called context-free ambiguity.

23
Ambiguity
  • May be able to eliminate ambiguities by
    rearranging the grammar
  • ltstmtgt ltmatchedgt ltunmatchedgt
  • ltmatchedgt
  • if ltexprgt then ltmatchedgt else ltmatchedgt other
  • ltunmatchedgt
  • if ltexprgt then ltstmtgt
  • if ltexprgt then ltmatchedgt else ltunmatchedgt
  • This generates the same language as the ambiguous
    grammar, but applies the common sense rule match
    each else with the closest unmatched then
  • This is most likely the language designers intent
Write a Comment
User Comments (0)
About PowerShow.com