CSCI 435 Compiler Design - PowerPoint PPT Presentation

1 / 28
About This Presentation
Title:

CSCI 435 Compiler Design

Description:

CSCI 435 Compiler Design. Topics of the Day. Tokens to Syntax Tree. Parsing Methods ... Error handling leaves a lot to be desired (laconic error handling) 3.2.16 ... – PowerPoint PPT presentation

Number of Views:665
Avg rating:3.0/5.0
Slides: 29
Provided by: OwenAst9
Category:

less

Transcript and Presenter's Notes

Title: CSCI 435 Compiler Design


1
CSCI 435 Compiler Design
  • Week 3 Class 2
  • Section 2.2 From Tokens to Syntax Tree to Section
    2.2.4.1 LL(1) Parsing
  • (110-126)
  • Ray Schneider

2
Topics of the Day
  • Tokens to Syntax Tree
  • Parsing Methods
  • Error Detection and Error Recovery
  • Top Down Parsing

3
Tokens to Syntax Tree
  • Two ways of parsing
  • TOP DOWN and BOTTOM UP
  • Top Down
  • 1) written by hand or 2) generated automatically
  • Bottom Up
  • 3) can only be generated
  • All 3 cases syntax structure specified using
    context-free grammars

4
Importance of Grammars
  • 1) Imposes a structure on the linear sequence of
    tokens and a framework for erecting semantics on
    the nodes of the structure
  • 2) Allows automatic construction of parsers
    through the field of formal languages
  • 3) Helps create syntactically correct programs
    and provide detailed answers about syntax

5
Two Ways to do Parsing
  • the LL Method deterministic left-to-right
    top-down
  • the LR and LALR Methods deterministic
    left-to-right bottom-up
  • Left-to-right
  • means the sequence of tokens (program text) is
    processed from left to right one token at a time
  • Deterministic
  • No Searching (ideal) each token processed leads
    one closer to the final construction of the
    syntax tree, hence implies LINEAR TIME
  • Only work for restricted classes of grammars
  • Resulting grammars for deterministic parsers are
    guaranteed to be non-ambiguous
  • Real grammars don't always cooperate so
    transformation methods are needed to bring them
    into line.
  • Non-ambiguous means either one syntax tree is
    generated if the program is syntactically correct
    or the program contains errors.

6
Parsing Methods
  • Constructs the syntax tree for a given sequence
    of tokens (i.e. a tree of nodes labeled with
    grammar symbols, such that
  • Leaf nodes are labeled with terminals
  • Inner nodes are labeled with non-terminals
  • TOP NODE is labeled with the Start Symbol
  • Children of an inner node labeled N correspond to
    the members of an alternative of N, in the same
    order as they occur in that alternative
  • Terminals labeling the leaf nodes correspond to
    the sequence of tokens, in the same order as they
    occur in the input.

7
Top Down or Bottom Upi.e. Pre-order or Post-order
  • Parsing Methods are either Top Down or
    Bottom up depending on how the nodes of the
    syntax tree are constructed

TREE TRAVERSAL Pre-Order visit node N and then
N's sub-trees in left-to-right order. Post-Order
visit N's sub-trees in left-to-right order and
then visit node N. TERMS visiting a node doing
something to the node in support of an algorithm
that motivates the traversal. traversing a node
visiting that node and traversing its sub-trees
in some order. traversing a tree traversing the
top node which will recursively traverse the
whole tree. Traversing belongs to the control
mechanism.
8
Top Down Parser
  • construct top node
  • from top node construct children in alternative
    order
  • determine correct alternative
  • proceed down until one reaches a leftmost
    terminal
  • terminal then matches first token

9
Bottom Up Parser
  • constructs nodes in post-order
  • constructs a node only when all children have
    been constructed
  • 1st node constructed is the top of the first
    complete sub-tree it meets going left to right
    through the input

10
Error Detection and Error Recovery
  • an error is detected when the construction of the
    syntax tree fails
  • since tree is built from parsing methods which
    read the tokens from left-to-right failure occurs
    at a SPECIFIC TOKEN, two questions
  • What error message to give to the user? and
  • Whether and how to proceed after the error?
  • ex. x a(pq(-b(r-s)
  • position of detection may not reflect position of
    error
  • We have to do error recovery to give users some
    idea of how many errors there are, two strategies
  • error correction patch and continue
  • non-correcting error recovery discard and
    continue with suffix grammar yields reliable
    error detection but difficult to implement and
    rarely found in parser generators.

point of detection
11
Manual Creation of a Top Down Parser
  • Given non-terminal N, token t at position p in
    the input, (N, t, p)
  • Top Down parser must decide
  • Which alternative of N must be applied to obtain
    a sub-tree headed by N with the correct sub-tree
    at position p ?
  • How do you tell that a tree is incorrect?
  • IT HAS A DIFFERENT TOKEN THAN t AS ITS LEFTMOST
    LEAF AT POSITION p !
  • So a correct tree (or reasonable approximation
    thereto) starts with t or is empty
  • Obvious implementation is a recursive Boolean
    function that tests possibilities until it find a
    possible tree, called RECURSIVE DESIGN PARSER

t1 t2 t3 t4 t5 t6 t7 t8 t9
12
Recursive descent parsing
  • Next figure shows a Recursive Descent Parser
    RECOGNIZER lacks code parse tree construction
  • Grammar is a simple arithmetic expression which
    is right associative in the '' operator (see
    example token stream)
  • IDENTIFIER (INDENTIFIER IDENTIFIER) EOF
  • Parser text shows a very direct relationship
    between parser and grammar (note utility of
    and lazy Boolean operators)
  • One of the attractions of Recursive Descent
    Parsing
  • Each rule corresponds with an integer routine
    that returns 1 if a terminal production of N was
    found in the present position in the input
    stream, otherwise it returns 0, i.e. no terminal
    found.

13
The Driver
include "lex.h" / for start_lex(),
get_next_token(), Token / / DRIVER / int
main(void) start_lex() get_next_token()
require(input())//call START TOKEN return
0 void error(void) printf("Error in
expression\n") exit(1)
14
Recursive Descent Parser for grammar
define EoF 256 define IDENTIFIER 257
include "tokennumbers.h" / PARSER / int
input(void) return expression()
require(token(EoF)) int expression(void)
return term() require(rest_expression()) int
term(void) return token(IDENTIFIER)
parenthesized_expression() int
parenthesized_expression(void) return
token('(') require(expression())
require(token(')')) int rest_expression(void)
return token('') require(expression())
1 int token(int tk) if (tk !
Token.class) return 0 get_next_token()
return 1 int require(int found) if
(!found) error() return 1
input? expression EOF expression? term
rest_expression term? INDENTIFIER
parenthesized_expression parenthesized_expression?
'(' expression ')' rest_expression? ''
expression e
grammar of 2.53 and fig 2.4
15
Three Drawbacks
  • Repeated backtracking over one token due to
    repeated calls to token(int tk) causing repeated
    testing of Token.class
  • Operationally method often fails to produce a
    correct parser
  • partial consumption of expressions causes the
    parser to be stranded (see examples pg. 119)
  • Recursive descent parsers cannot handle
    left-recursive grammars a serious disadvantage
  • Error handling leaves a lot to be desired
    (laconic error handling)

16
Automatic Creation of a Top down Parser
  • Grammars that allow automatic construction of a
    top down parser are called LL(1) grammars
  • LL(1) uses a push down automaton (section
    2.2.4.4)
  • Applying precomputation is based on the
    observation that when a routine for N is called
    with the same token t the same sequence of
    operations is called with the same result so we
    can precompute for each N what is required for
    each token t
  • Don't need to call other routines to find the
    answer ORTHOGONALITY
  • Avoid the search overhead since only a single
    routine is called
  • and Serendipitously it provides a solution to the
    problems on pages 119 and 120

17
LL(1) parsing
  • final decision on success or failure was made by
    comparing the input token to the first token
    produced by the alternatives
  • so we can create FIRST sets the sets of first
    tokens produced by all alternatives in the
    grammar, both of Non-Terminals, N and terminals
  • FIRST(a), i.e. the FIRST set of the alternative
    a, contains all terminals a can start with, or e
    the empty string may be included in FIRST(a) if a
    can produce e
  • Trivial if a starts with a terminal, ex.
  • parenthesized_expression? '(' expression ')'
  • Tougher if a starts with a Non-Terminal, N
  • Then we have to find FIRST(N) the Union of the
    FIRST sets of its alternatives, which can be
    computed with a closure algorithm

18
Closure Algorithm for FIRST sets in G
  • Data Definitions
  • Token sets called FIRST sets for all terminals,
    non-terminals and alternatives of non-terminals
    in G
  • A token set called FIRST for each alternative
    tail in G an alternative tail is a sequence of
    zero or more grammar symbols a if A a is an
    alternative or alternative tail in G.
  • Initializations
  • For all terminals T, set FIRST(T) to T.
  • For all non-terminals, N, set FIRST(N) to the
    empty set.
  • For all non-terminal alternatives and alternative
    tails a, set FIRST(a) to the empty set.
  • Set the FIRST set of all empty alternatives and
    alternative tails to e.
  • Inference rules
  • For each rule N?a in G, FIRST(N) must contain all
    tokens in FIRST(a), including e if FIRST(a)
    contains it.
  • For each alternative or alternative tail a of the
    form Ab, FIRST(a) must contain all tokens in
    FIRST(A), excluding e, should FIRST(A) contain
    it.
  • For each alternative or alternative tail a of the
    form Ab and FIRST(A) contains e, FIRST(a) must
    contain all tokens in FIRST(b), including e if
    FIRST(b) contains it.

figure 2.58
19
Initial FIRST sets of our example grammar
input expression EOF
EOF EOF expression term
rest_expression rest_expression
term IDENTIFIER IDENTIFIER
parenthesized_expression parenthesized_expr
ession '(' expression ')' '('
expression ')' ')' ')'
rest_expression '' expression ''
expression e e
20
Final FIRST sets
input IDENTIFIER '(' expression EOF
IDENTIFIER '(' EOF EOF
expression IDENTIFIER '(' term
rest_expression IDENTIFIER '('
rest_expression '' e term IDENTIFIER
'(' IDENTIFIER IDENTIFIER
parenthesized_expression '('
parenthesized_expression '(' '('
expression ')' '(' expression
')' IDENTIFIER '(' ')' ')'
rest_expression '' e ''
expression '' expression
IDENTIFIER '(' e e
21
Predictive Recursive Descent Parser
  • FIRST sets are used to construct a predictive
    parser (probably ought to be called grammar
    directed parser since it doesn't really predict
    anything)
  • Code for each alternative is preceded by a CASE
    label based on the FIRST set
  • Testing is done on tokens only (using switch
    statements in C)
  • Routine for grammar rule only called when it is
    certain (if no syntactic error) to produce a
    terminal production

22
Predictive parser 1 (first half)
void input(void) switch (Token.class)
case IDENTIFIER case '('
expression() token(EoF) break default
error() void expression(void)
switch (Token.class) case IDENTIFIER
case '(' term()
rest_expression() break default
error() void term(void) switch
(Token.class) case IDENTIFIER
token(IDENTIFIER) break case '('
parenthesized_expression() break default
error()
first part of 2.61
23
Predictive Parser 2
void parenthesized_expression(void) switch
(Token.class) case '('
token('(') expression() token(')') break
default error() void
rest_expression(void) switch (Token.class)
case '' token('')
expression() break case EoF case ')'
break default error()
void token(int tk) if (tk !
Token.class) error() get_next_token()
second part of 2.61
24
LL(1) parsing with nullable alternatives
  • Complication how to handle the case label for
    the empty alternative since it does not start
    with any token
  • Solution When N produces an empty string we
    don't see the string, but we do see a token that
    can follow N
  • Create the FOLLOW set the set of tokens that can
    immediately follow a given non-terminal N (see
    closure algorithm fig. 2.62)

25
LL(1) parser/grammar
  • LL(1) parser is called LL(1) because the parser
    works from Left to Right identifying nodes in
    Leftmost derivative order, and '(1)' because all
    choices are based on one-token look ahead. A
    grammar for which this parsing works is called an
    LL(1) grammar.
  • What we've seen is a strong LL(1) grammar, there
    are lots of things to worry about (see the list
    on page 124)

26
Things to Worry About
  • repetition operators in the grammar
  • detecting and reporting parsing conflicts (to be
    covered next time)
  • including code for the generation of the syntax
    tree
  • including code and tables for syntax error
    recovery
  • optimizations

27
Homework for Week 3
  • Objective (two weeks), get a version of lex
    running and run it on the LexByLex folder
    material importing other files as necessary and
    provide a "blow-by-blow" description of your
    efforts and the result (Failure Is Not An Option)
    http//csmweb2.emcweb.com/durable/2000/08/10/p19s2
    .htm
  • problem to turn in next Monday
  • 2.8 (185-186)
  • Some Flex/Lex
  • http//www.ug.bcc.bilkent.edu.tr/resat/Articles/a
    rticle_1.htm
  • http//www.monmouth.com/wstreett/lex-yacc/lex-yac
    c.html

28
References
  • Text Modern Compiler Design Figures
Write a Comment
User Comments (0)
About PowerShow.com