Syntactic Analysis - PowerPoint PPT Presentation

1 / 40

About This Presentation

Title:

Syntactic Analysis

Description:

'for while i == == == 12 for ( abcd)' Lexer will produce a stream of tokens: TOKEN_FOR TOKEN_WHILE ... We build a lexer as a DFA (see previous lecture) ... – PowerPoint PPT presentation

Number of Views:67

Avg rating:3.0/5.0

Slides: 41

Provided by: henrica

Category:

more less

Transcript and Presenter's Notes

Title: Syntactic Analysis

1
Syntactic Analysis
2
The Big Picture Again
source code
Scanner
Parser
Opt1
Opt2
Optn
. . .
machine code
Instruction Selection
Register Allocation
Instruction Scheduling
COMPILER
3
Syntactic Analysis

Lexical Analysis was about ensuring that we
extract a set of valid words (i.e.,
tokens/lexemes) from the source code
But nothing says that the words make a coherent
sentence (i.e., program)
Example
for while i 12 for ( abcd)
Lexer will produce a stream of tokens
ltTOKEN_FORgt ltTOKEN_WHILEgt ltTOKEN_IDENT, igt
ltTOKEN_COMPAREgt ltTOKEN_COMPAREgt ltTOKEN_COMPAREgt
ltTOKEN_NUMBER,12gt ltTOKEN_OP, gt ltTOKEN_FORgt
ltTOKEN_OPARENgt ltTOKEN_ID, abcdgt ltTOKEN_CPARENgt
But clearly we do not have a valid program
This program is lexically correct, but
syntactically incorrect

4
Grammar

Question How do we determine that a sentence is
syntactically correct?
Answer We check against a grammar!
A grammar consists of rules that determine which
sentences are correct
Example in English
A sentence must have a verb
Example in C
A must have a matching

5
Grammar

Regular expressions are one way to specify a set
of rules
Unfortunately they are not powerful enough for
the purpose of describing the syntax of
programming languages
Example
A variable must be declared before used
We cant implement this with regular expressions
because they do not have memory!
no way of counting and remembering counts
Therefore we need a more powerful tool
This tool is called Context-Free Grammars
And some hacks on top of it

6
Context-Free Grammars

A context-free grammar (CFG) consists of a set of
production rules
Each rule describes how a non-terminal symbol can
be replaced or expanded by a string that
consists of non-terminal symbols and by terminal
symbols
Terminal symbols are really tokens
Rules are written with syntax like regular
expressions
Rules can then be applied recursively
Eventually one reaches a string of only terminal
symbols, or so one hopes
This string is syntactically correct according to
the grammatical rules!
Lets see a simple example

7
CFG Example

Set of non-terminals A, B, C (uppercase
initial)
Start non-terminal S (uppercase initial)
Set of terminal symbols a, b, c, d
Set of production rules
S ? A BC
A ? Aa a (Extended Backus-Naur form - EBNF)
B ? bBCb b
C ? dCcd c
We can now start producing syntactically valid
strings by doing derivations
Example derivations
S ? BC ? bBCbC ? bbdCcdbC ? bbdccdbC ? bbdccdbc
S ? A ? Aa ? Aaa ? Aaaa ? aaaa

8
A Grammar for Expressions

Expr ? Expr Op Expr
Expr ? Number Identifier
Identifier ? Letter Letter Identifier
Letter ? a-z
Op ? - /
Number ? Digit Number Digit
Digit ? 0 1 2 3 4 5 6 7 8 9
Expr ? Expr Op Expr ? Number Op Expr ?
Digit Number Op Expr ? 3 Number Op
Expr ? 34 Op Expr ? 34 Expr ? 34 Identifier ?
34 Letter Identifier ? 34 a
Identifier ? 34 a Letter ? 34 ax

9
What is Parsing?

What we just saw is the process of, starting with
the start symbol and, through a sequence of rule
derivation obtain a string of terminal symbols
We could generate all correct programs (infinite
set though)
Parsing the other way around
Give a string of non-terminals, the process of
discovering a sequence of rule derivations that
produce this particular string
When we say we cant parse a string, we mean that
we cant find any legal way in which the string
can be obtained from the start symbol through
derivations
What we want to build is a parser a program that
takes in a string of tokens (terminal symbols)
and discovers a derivation sequence, thus
validating that the input is a syntactically
correct program

10
Derivations as Trees

A convenient and natural way to represent a
sequence of derivations is a syntactic tree or
parse tree
Example Expr ? Expr Op Expr ? Number Op Expr ?
Digit Number Op Expr ? 3 Number Op Expr ? 34 Op
Expr ? 34 Expr ? 34 Identifier ? 34 Letter
Identifier ? 34 a Identifier ? 34 a Letter ?
34 ax

Expr
Expr
Expr
Op
Identifier
Number

Letter
Identifier
Digit
Number
Letter
3
Digit
a
x
4
11
Derivations as Trees

In the parser derivations are implemented as
trees
Often, we draw trees without the full derivations
Example

Expr
Expr
Expr
Op
Identifier
Number

ax
34
12
Ambiguity

We call a grammar ambiguous if a string of
terminal symbols can be reached by two different
derivation sequences
In other terms, a string can have more than one
parse tree
It turns out that our expression grammar is
ambiguous!
Lets show that string 358 has two parse trees

13
Ambiguity
14
Problems with Ambiguity

The problem is that the syntax impacts meaning
(for the later stages of the compiler)
For our example string, wed like to see the left
tree because we most likely want to have a
higher precedence than
We dont like ambiguity because it makes the
parsers difficult to design because we dont know
which parse tree will be discovered when there
are multiple possibilities
So we often want to disambiguate grammars
It turns out that it is possible to modify
grammars to make them non-ambiguous
by adding non-terminals
by adding/rewriting production rules
In the case of our expression grammar, we can
rewrite the grammar to remove ambiguity and to
ensure that parse trees match our notion of
operator precedence
We get two benefits for the price of one
Would work for many operators and many precedence
relations

15
Non-Ambiguous Grammar

Expr ? Term Expr Term Expr - Term
Term ? Term Factor
Term / Factor
Factor
Factor ? Number Identifier
Example 453-89

Expr
Expr
Term
-
Expr
Term

Factor
Term
Term
Factor
Factor

Factor
Number
Number
Term
Factor

Number
Factor
Number
3
9
Number
8
5
4
16
Non-Ambiguous Grammar

Expr ? Term Expr Term Expr - Term
Term ? Term Factor
Term / Factor
Factor
Factor ? Number Identifier
Example 453-89

Expr
Expr
Term
-
Expr
Term

Factor
Term
Term
Factor
Factor

Factor
Number
Number
Term
Factor

Number
Factor
Number
3
9
Number
8
5
4
17
In-class Exercise

Consider the CFG
S ? ( L ) a
L ? L , S S
Draw parse trees for
(a, a)
(a, ((a, a), (a, a)))

18
In-class Exercise

Consider the CFG
S ? ( L ) a
L ? L , S S
Draw parse trees for
(a, a)
(a, ((a, a), (a, a)))

S
(
L
)
S
L
,
a
S
a
19
In-class Exercise
S
(
L
)

Consider the CFG
S ? ( L ) a
L ? L , S S
Draw parse trees for
(a, a)
(a, ((a, a), (a, a)))

S
L
,
L
S
)
(
a
S
L
,
L
)
(
S
S
L
,
(
L
)
S
a
S
L
,
a
S
a
a
20
In-class Exercise

Write a CFG grammar for the language of
well-formed parenthesized expressions
(), (()), ()(), (()()), etc. OK
()), )(, ((()), (((, etc. not OK

21
In-class Exercise

Write a CFG grammar for the language of
well-formed parenthesized expressions
(), (()), ()(), (()()), etc. OK
()), )(, ((()), (((, etc. not OK
P ? () PP (P)

22
In-class Exercise

Is the following grammar ambiguous?
A ? A and A not A 0 1

23
In-class Exercise

Is the following grammar ambiguous?
A ? A and A not A 0 1

A
A
not
A
A
A
and
A
1
A
not
A
and
0
1
0
24
Another Example Grammar

ForStatement ? for ( StmtCommaList
ExprCommaList StmtCommaList )
StmtSemicList
StmtCommaList ? ? Stmt Stmt ,
StmtCommaList
ExprCommaList ? ? Expr Expr ,
ExprCommaList
StmtSemicList ? ? Stmt Stmt
StmtSemicList
Expr ? . . .
Stmt ? . . .

25
Full Language Grammar Sketch

Program ? VarDeclList FuncDeclList
VarDeclList ? ? VarDecl VarDecl VarDeclList
VarDecl ? Type IdentCommaList
IdentCommaList ? Ident Ident , IdentCommaList
Type ? int char float
FuncDeclList ? ? FuncDecl FuncDecl
FuncDeclList
FuncDecl ? Type Ident ( ArgList )
VarDeclList StmtList
StmtList ? ? Stmt Stmt StmtList
Stmt ? Ident Expr ForStatement ...
Expr ? ...
Ident ? ...

26
Real-world CFGs

Some sample grammars found on the Web
LISP 7 rules
PROLOG 19 rules
Java 30 rules
C 60 rules
Ada 280 rules
LISP is particularly easy because
No operators, just function calls
Therefore no precedence, associativity
LISP is thus very easy to parse
In the Java specification the description of
operator precedence and associativity takes 25
pagers

27
So What Now?

We want to write a compiler for a given language
We come up with a definition of the tokens
embodied in regular expressions
We build a lexer as a DFA (see previous lecture)
We come up with a definition of the syntax
embodied in a context-free grammar
not ambiguous
enforces relevant operator precedences and
associativity
Question How do we build a parser?
i.e., a program that given an input source file
produces a parse tree

28
How do we build a Parser?

This question could keep us busy for 1/2 semester
in a full-fledge compiler course
So were just going to see a very high-level view
of parsing
If you go to graduate school youll most likely
have an in-depth compiler course with all the
details
There are two approaches for parsing
Top-Down Start with the start symbol and try to
expand it using derivation rules until you get
the input source code
Bottom-Up Start with the input source code,
consume symbols, and infer which rules could be
used
Note this does not work for all CFGs
CFGs much have some properties to be parsable
with our beloved parsing algorithms

29
Top-Down Parsing

A simple recursive algorithm
Start with the start symbol
Pick one of the rules to expand it an expand it
If the leftmost symbol is a non-terminal and
matches the current token of the input source,
great
If there is no match, then backtrack and try
another rule
Repeat for all non-terminal symbols
Success if we get all terminals
Failure if weve tried all productions without
getting all terminals
Lets see this on an example

30
Top-Down Parsing Example

A simple grammar
(R1) Expr ? Number Expr
(R2) Expr ? Number
(R3) Expr ? Number Expr
(R4) Number ? 0-9
Lets parse string 345
(R1) Number Expr
(R4) 3 Expr backtrack
(R2) Number
(R4) 3 backtrack
(R3) Number Expr
(R4) 3 Expr
(R1) 3 Number Expr
(R4) 3 4 Expr
(R1) 3 4
Expr Expr
(R2) 3 4
Number Expr
(R4) 3
4 5 Expr backtrack
(R2) 3 4
Number
(R4) 3 4
5 done!

31
Left-Recursion

One problem for the Top-Down approach is
left-recursive rules
Example Expr ? Expr Number
The Parser will expand the leftmost Expr as Expr
Number to get Expr Number Number
And again Expr Number Number Number
And again Expr Number Number Number
Number
Ad infinitum. . .
Since the leftmost symbol is never a non-terminal
symbol the parser will never check for a match
with the source code and will be stuck in an
infinite loop
Luckily, there are ways to remove left-recursion

32
Bottom-Up Parsing

Bottom-up parsing is more general than top-down
and is the method typically used in practice
The idea is very simple
Look at the string of tokens, from left to right
Look for things that look like the right-hand
side of production rules
Replace the tokens
Lets see an example

33
Bottom-Up Parsing Example

A simple grammar
(R1) Expr ? Number Expr
(R2) Expr ? Number
(R3) Expr ? Number Expr
(R4) Number ? 0-9
Lets parse string 345
Number 4 5 (R4)
Number Number 5 (R4)
Expr 5 (R3)
Expr Number (R4)
Expr (R5) done

34
Bottom-Up Parsing

The previous example made it look very simple,
but this doesnt always work
Turns out there is a way to do this
(shift-reduce parsing) that is guaranteed to
work for any non-ambiguous grammar
Uses a stack to do some backtracking
More about all this in a compiler course

35
Writing Parsers?

Nowadays one doesnt really write parsers from
scratch, but one uses a parser generator (Yacc is
a famous one)

token stream
parse tree
Parser
compile time
compiler design time
grammar specification
Parser Generator
36
Sample (simplified) YACC Input

token DIGIT / Definition of token names /
line expr \n
expr expr term
term
term term factor
factor
factor ( expr )
DIGIT

37
So What Now?

The parser accepts syntactically correct programs
and produces a full parse tree
Unfortunately, being syntactically correct is a
necessary condition for the program to be correct
(i.e., compilable), but is not sufficient
Lets see this on a simple example
Say we want to write a compiler for the Ada
language
The Ada language requires that procedures be
written as
procedure my_func
. . .
end my_func
An incorrect program
procedure my_func
. . .
end some_other_name
Problem There is no way to express the both
names should be the same requirement in a CFG!
Both are seen as a TOKEN_IDENT token

38
Attributed Syntax Tree

To perform such checks we need to associate
attributed to nodes in the Syntax Tree and to
define rules about these attributes
You can really see this as adding tons of little
pieces of code associated with grammar rules
There would be a lot of material to cover here,
but lets just see two simple examples
Example 1 The Ada Example
ProcDecl ? procedure Ident ProcBody end Ident
Whenever this rule is used, run the piece of
code
if (Ident1.lexeme ! Ident2.lexeme)
fprintf(stderr,Syntax error non-matched
procedure names\n
exit(1)

39
Attributed Syntax Tree

Example 2 Type Checking
Say we have a language in which the body of a
function can declare variables
VarDecl ? Type Ident
Each time we see this we execute the following
code
Symbol_Table.insert(Ident.lexeme, type.lexeme)
(updates some table that keeps track of
variables and their types)
Sum ? Ident Number Number
Each time we see this we execute the following
code
if ((Number1.type float)
(Number1.type float))
if (Symbol_Table.lookupType(Ident.lexeme)
! float)
fprintf(stderr,Syntax error float
must be assigned to float\n)
exit(1)

40
Conclusion

The goal here was to give you some idea of how a
parser can be built, and where these parsing
error messages you see come from
Of course its much more complicated than what
Ive let on
There are several great books on compilers that
have very detailed sections on syntactic analysis
A Classic Compilers Principles, Techniques,
and Tools