Syntactic Analysis - PowerPoint PPT Presentation

1 / 40
About This Presentation
Title:

Syntactic Analysis

Description:

'for while i == == == 12 for ( abcd)' Lexer will produce a stream of tokens: TOKEN_FOR TOKEN_WHILE ... We build a lexer as a DFA (see previous lecture) ... – PowerPoint PPT presentation

Number of Views:67
Avg rating:3.0/5.0
Slides: 41
Provided by: henrica
Category:

less

Transcript and Presenter's Notes

Title: Syntactic Analysis


1
Syntactic Analysis
2
The Big Picture Again
source code
Scanner
Parser
Opt1
Opt2
Optn
. . .
machine code
Instruction Selection
Register Allocation
Instruction Scheduling
COMPILER
3
Syntactic Analysis
  • Lexical Analysis was about ensuring that we
    extract a set of valid words (i.e.,
    tokens/lexemes) from the source code
  • But nothing says that the words make a coherent
    sentence (i.e., program)
  • Example
  • for while i 12 for ( abcd)
  • Lexer will produce a stream of tokens
    ltTOKEN_FORgt ltTOKEN_WHILEgt ltTOKEN_IDENT, igt
    ltTOKEN_COMPAREgt ltTOKEN_COMPAREgt ltTOKEN_COMPAREgt
    ltTOKEN_NUMBER,12gt ltTOKEN_OP, gt ltTOKEN_FORgt
    ltTOKEN_OPARENgt ltTOKEN_ID, abcdgt ltTOKEN_CPARENgt
  • But clearly we do not have a valid program
  • This program is lexically correct, but
    syntactically incorrect

4
Grammar
  • Question How do we determine that a sentence is
    syntactically correct?
  • Answer We check against a grammar!
  • A grammar consists of rules that determine which
    sentences are correct
  • Example in English
  • A sentence must have a verb
  • Example in C
  • A must have a matching

5
Grammar
  • Regular expressions are one way to specify a set
    of rules
  • Unfortunately they are not powerful enough for
    the purpose of describing the syntax of
    programming languages
  • Example
  • A variable must be declared before used
  • We cant implement this with regular expressions
    because they do not have memory!
  • no way of counting and remembering counts
  • Therefore we need a more powerful tool
  • This tool is called Context-Free Grammars
  • And some hacks on top of it

6
Context-Free Grammars
  • A context-free grammar (CFG) consists of a set of
    production rules
  • Each rule describes how a non-terminal symbol can
    be replaced or expanded by a string that
    consists of non-terminal symbols and by terminal
    symbols
  • Terminal symbols are really tokens
  • Rules are written with syntax like regular
    expressions
  • Rules can then be applied recursively
  • Eventually one reaches a string of only terminal
    symbols, or so one hopes
  • This string is syntactically correct according to
    the grammatical rules!
  • Lets see a simple example

7
CFG Example
  • Set of non-terminals A, B, C (uppercase
    initial)
  • Start non-terminal S (uppercase initial)
  • Set of terminal symbols a, b, c, d
  • Set of production rules
  • S ? A BC
  • A ? Aa a (Extended Backus-Naur form - EBNF)
  • B ? bBCb b
  • C ? dCcd c
  • We can now start producing syntactically valid
    strings by doing derivations
  • Example derivations
  • S ? BC ? bBCbC ? bbdCcdbC ? bbdccdbC ? bbdccdbc
  • S ? A ? Aa ? Aaa ? Aaaa ? aaaa

8
A Grammar for Expressions
  • Expr ? Expr Op Expr
  • Expr ? Number Identifier
  • Identifier ? Letter Letter Identifier
  • Letter ? a-z
  • Op ? - /
  • Number ? Digit Number Digit
  • Digit ? 0 1 2 3 4 5 6 7 8 9
  • Expr ? Expr Op Expr ? Number Op Expr ?
    Digit Number Op Expr ? 3 Number Op
    Expr ? 34 Op Expr ? 34 Expr ? 34 Identifier ?
    34 Letter Identifier ? 34 a
    Identifier ? 34 a Letter ? 34 ax

9
What is Parsing?
  • What we just saw is the process of, starting with
    the start symbol and, through a sequence of rule
    derivation obtain a string of terminal symbols
  • We could generate all correct programs (infinite
    set though)
  • Parsing the other way around
  • Give a string of non-terminals, the process of
    discovering a sequence of rule derivations that
    produce this particular string
  • When we say we cant parse a string, we mean that
    we cant find any legal way in which the string
    can be obtained from the start symbol through
    derivations
  • What we want to build is a parser a program that
    takes in a string of tokens (terminal symbols)
    and discovers a derivation sequence, thus
    validating that the input is a syntactically
    correct program

10
Derivations as Trees
  • A convenient and natural way to represent a
    sequence of derivations is a syntactic tree or
    parse tree
  • Example Expr ? Expr Op Expr ? Number Op Expr ?
    Digit Number Op Expr ? 3 Number Op Expr ? 34 Op
    Expr ? 34 Expr ? 34 Identifier ? 34 Letter
    Identifier ? 34 a Identifier ? 34 a Letter ?
    34 ax

Expr
Expr
Expr
Op
Identifier
Number

Letter
Identifier
Digit
Number
Letter
3
Digit
a
x
4
11
Derivations as Trees
  • In the parser derivations are implemented as
    trees
  • Often, we draw trees without the full derivations
  • Example

Expr
Expr
Expr
Op
Identifier
Number

ax
34
12
Ambiguity
  • We call a grammar ambiguous if a string of
    terminal symbols can be reached by two different
    derivation sequences
  • In other terms, a string can have more than one
    parse tree
  • It turns out that our expression grammar is
    ambiguous!
  • Lets show that string 358 has two parse trees

13
Ambiguity
14
Problems with Ambiguity
  • The problem is that the syntax impacts meaning
    (for the later stages of the compiler)
  • For our example string, wed like to see the left
    tree because we most likely want to have a
    higher precedence than
  • We dont like ambiguity because it makes the
    parsers difficult to design because we dont know
    which parse tree will be discovered when there
    are multiple possibilities
  • So we often want to disambiguate grammars
  • It turns out that it is possible to modify
    grammars to make them non-ambiguous
  • by adding non-terminals
  • by adding/rewriting production rules
  • In the case of our expression grammar, we can
    rewrite the grammar to remove ambiguity and to
    ensure that parse trees match our notion of
    operator precedence
  • We get two benefits for the price of one
  • Would work for many operators and many precedence
    relations

15
Non-Ambiguous Grammar
  • Expr ? Term Expr Term Expr - Term
  • Term ? Term Factor
  • Term / Factor
  • Factor
  • Factor ? Number Identifier
  • Example 453-89

Expr
Expr
Term
-
Expr
Term

Factor
Term
Term
Factor
Factor

Factor
Number
Number
Term
Factor

Number
Factor
Number
3
9
Number
8
5
4
16
Non-Ambiguous Grammar
  • Expr ? Term Expr Term Expr - Term
  • Term ? Term Factor
  • Term / Factor
  • Factor
  • Factor ? Number Identifier
  • Example 453-89

Expr
Expr
Term
-
Expr
Term

Factor
Term
Term
Factor
Factor

Factor
Number
Number
Term
Factor

Number
Factor
Number
3
9
Number
8
5
4
17
In-class Exercise
  • Consider the CFG
  • S ? ( L ) a
  • L ? L , S S
  • Draw parse trees for
  • (a, a)
  • (a, ((a, a), (a, a)))

18
In-class Exercise
  • Consider the CFG
  • S ? ( L ) a
  • L ? L , S S
  • Draw parse trees for
  • (a, a)
  • (a, ((a, a), (a, a)))

S
(
L
)
S
L
,
a
S
a
19
In-class Exercise
S
(
L
)
  • Consider the CFG
  • S ? ( L ) a
  • L ? L , S S
  • Draw parse trees for
  • (a, a)
  • (a, ((a, a), (a, a)))

S
L
,
L
S
)
(
a
S
L
,
L
)
(
S
S
L
,
(
L
)
S
a
S
L
,
a
S
a
a
20
In-class Exercise
  • Write a CFG grammar for the language of
    well-formed parenthesized expressions
  • (), (()), ()(), (()()), etc. OK
  • ()), )(, ((()), (((, etc. not OK

21
In-class Exercise
  • Write a CFG grammar for the language of
    well-formed parenthesized expressions
  • (), (()), ()(), (()()), etc. OK
  • ()), )(, ((()), (((, etc. not OK
  • P ? () PP (P)

22
In-class Exercise
  • Is the following grammar ambiguous?
  • A ? A and A not A 0 1

23
In-class Exercise
  • Is the following grammar ambiguous?
  • A ? A and A not A 0 1

A
A
not
A
A
A
and
A
1
A
not
A
and
0
1
0
24
Another Example Grammar
  • ForStatement ? for ( StmtCommaList
    ExprCommaList StmtCommaList )
    StmtSemicList
  • StmtCommaList ? ? Stmt Stmt ,
    StmtCommaList
  • ExprCommaList ? ? Expr Expr ,
    ExprCommaList
  • StmtSemicList ? ? Stmt Stmt
    StmtSemicList
  • Expr ? . . .
  • Stmt ? . . .

25
Full Language Grammar Sketch
  • Program ? VarDeclList FuncDeclList
  • VarDeclList ? ? VarDecl VarDecl VarDeclList
  • VarDecl ? Type IdentCommaList
  • IdentCommaList ? Ident Ident , IdentCommaList
  • Type ? int char float
  • FuncDeclList ? ? FuncDecl FuncDecl
    FuncDeclList
  • FuncDecl ? Type Ident ( ArgList )
    VarDeclList StmtList
  • StmtList ? ? Stmt Stmt StmtList
  • Stmt ? Ident Expr ForStatement ...
  • Expr ? ...
  • Ident ? ...

26
Real-world CFGs
  • Some sample grammars found on the Web
  • LISP 7 rules
  • PROLOG 19 rules
  • Java 30 rules
  • C 60 rules
  • Ada 280 rules
  • LISP is particularly easy because
  • No operators, just function calls
  • Therefore no precedence, associativity
  • LISP is thus very easy to parse
  • In the Java specification the description of
    operator precedence and associativity takes 25
    pagers

27
So What Now?
  • We want to write a compiler for a given language
  • We come up with a definition of the tokens
    embodied in regular expressions
  • We build a lexer as a DFA (see previous lecture)
  • We come up with a definition of the syntax
    embodied in a context-free grammar
  • not ambiguous
  • enforces relevant operator precedences and
    associativity
  • Question How do we build a parser?
  • i.e., a program that given an input source file
    produces a parse tree

28
How do we build a Parser?
  • This question could keep us busy for 1/2 semester
    in a full-fledge compiler course
  • So were just going to see a very high-level view
    of parsing
  • If you go to graduate school youll most likely
    have an in-depth compiler course with all the
    details
  • There are two approaches for parsing
  • Top-Down Start with the start symbol and try to
    expand it using derivation rules until you get
    the input source code
  • Bottom-Up Start with the input source code,
    consume symbols, and infer which rules could be
    used
  • Note this does not work for all CFGs
  • CFGs much have some properties to be parsable
    with our beloved parsing algorithms

29
Top-Down Parsing
  • A simple recursive algorithm
  • Start with the start symbol
  • Pick one of the rules to expand it an expand it
  • If the leftmost symbol is a non-terminal and
    matches the current token of the input source,
    great
  • If there is no match, then backtrack and try
    another rule
  • Repeat for all non-terminal symbols
  • Success if we get all terminals
  • Failure if weve tried all productions without
    getting all terminals
  • Lets see this on an example

30
Top-Down Parsing Example
  • A simple grammar
  • (R1) Expr ? Number Expr
  • (R2) Expr ? Number
  • (R3) Expr ? Number Expr
  • (R4) Number ? 0-9
  • Lets parse string 345
  • (R1) Number Expr
  • (R4) 3 Expr backtrack
  • (R2) Number
  • (R4) 3 backtrack
  • (R3) Number Expr
  • (R4) 3 Expr
  • (R1) 3 Number Expr
  • (R4) 3 4 Expr
  • (R1) 3 4
    Expr Expr
  • (R2) 3 4
    Number Expr
  • (R4) 3
    4 5 Expr backtrack
  • (R2) 3 4
    Number
  • (R4) 3 4
    5 done!

31
Left-Recursion
  • One problem for the Top-Down approach is
    left-recursive rules
  • Example Expr ? Expr Number
  • The Parser will expand the leftmost Expr as Expr
    Number to get Expr Number Number
  • And again Expr Number Number Number
  • And again Expr Number Number Number
    Number
  • Ad infinitum. . .
  • Since the leftmost symbol is never a non-terminal
    symbol the parser will never check for a match
    with the source code and will be stuck in an
    infinite loop
  • Luckily, there are ways to remove left-recursion

32
Bottom-Up Parsing
  • Bottom-up parsing is more general than top-down
    and is the method typically used in practice
  • The idea is very simple
  • Look at the string of tokens, from left to right
  • Look for things that look like the right-hand
    side of production rules
  • Replace the tokens
  • Lets see an example

33
Bottom-Up Parsing Example
  • A simple grammar
  • (R1) Expr ? Number Expr
  • (R2) Expr ? Number
  • (R3) Expr ? Number Expr
  • (R4) Number ? 0-9
  • Lets parse string 345
  • Number 4 5 (R4)
  • Number Number 5 (R4)
  • Expr 5 (R3)
  • Expr Number (R4)
  • Expr (R5) done

34
Bottom-Up Parsing
  • The previous example made it look very simple,
    but this doesnt always work
  • Turns out there is a way to do this
    (shift-reduce parsing) that is guaranteed to
    work for any non-ambiguous grammar
  • Uses a stack to do some backtracking
  • More about all this in a compiler course

35
Writing Parsers?
  • Nowadays one doesnt really write parsers from
    scratch, but one uses a parser generator (Yacc is
    a famous one)

token stream
parse tree
Parser
compile time
compiler design time
grammar specification
Parser Generator
36
Sample (simplified) YACC Input
  • token DIGIT / Definition of token names /
  • line expr \n
  • expr expr term
  • term
  • term term factor
  • factor
  • factor ( expr )
  • DIGIT

37
So What Now?
  • The parser accepts syntactically correct programs
    and produces a full parse tree
  • Unfortunately, being syntactically correct is a
    necessary condition for the program to be correct
    (i.e., compilable), but is not sufficient
  • Lets see this on a simple example
  • Say we want to write a compiler for the Ada
    language
  • The Ada language requires that procedures be
    written as
  • procedure my_func
  • . . .
  • end my_func
  • An incorrect program
  • procedure my_func
  • . . .
  • end some_other_name
  • Problem There is no way to express the both
    names should be the same requirement in a CFG!
  • Both are seen as a TOKEN_IDENT token

38
Attributed Syntax Tree
  • To perform such checks we need to associate
    attributed to nodes in the Syntax Tree and to
    define rules about these attributes
  • You can really see this as adding tons of little
    pieces of code associated with grammar rules
  • There would be a lot of material to cover here,
    but lets just see two simple examples
  • Example 1 The Ada Example
  • ProcDecl ? procedure Ident ProcBody end Ident
  • Whenever this rule is used, run the piece of
    code
  • if (Ident1.lexeme ! Ident2.lexeme)
  • fprintf(stderr,Syntax error non-matched
    procedure names\n
  • exit(1)

39
Attributed Syntax Tree
  • Example 2 Type Checking
  • Say we have a language in which the body of a
    function can declare variables
  • VarDecl ? Type Ident
  • Each time we see this we execute the following
    code
  • Symbol_Table.insert(Ident.lexeme, type.lexeme)
  • (updates some table that keeps track of
    variables and their types)
  • Sum ? Ident Number Number
  • Each time we see this we execute the following
    code
  • if ((Number1.type float)
    (Number1.type float))
  • if (Symbol_Table.lookupType(Ident.lexeme)
    ! float)
  • fprintf(stderr,Syntax error float
    must be assigned to float\n)
  • exit(1)

40
Conclusion
  • The goal here was to give you some idea of how a
    parser can be built, and where these parsing
    error messages you see come from
  • Of course its much more complicated than what
    Ive let on
  • There are several great books on compilers that
    have very detailed sections on syntactic analysis
  • A Classic Compilers Principles, Techniques,
    and Tools
Write a Comment
User Comments (0)
About PowerShow.com