Lexical Analysis and Parsing - PowerPoint PPT Presentation

1 / 51
About This Presentation
Title:

Lexical Analysis and Parsing

Description:

Scanner: deterministic finite automaton. Parser: pushdown automaton. ... Capture finite automaton. Case(switch) statements. Table and driver. Impurities. ... – PowerPoint PPT presentation

Number of Views:44
Avg rating:3.0/5.0
Slides: 52
Provided by: DalT1
Category:

less

Transcript and Presenter's Notes

Title: Lexical Analysis and Parsing


1
Lexical Analysis and Parsing
2
Regular Expressions
  • Later definitions build on earlier ones
  • Nothing defined in terms of itself (no recursion)

Regular grammar for numeric literals in
Pascaldigit -gt 012...89 unsigned_integer
-gt digit digit unsigned_number -gt
unsigned_integer
(( . unsigned_integer) e )
(( e ( - e ) unsigned_integer
) e )
3
Regular Expression Notation
  • a an ordinary letter
  • e the empty string
  • M N choosing from M or N
  • MN concatenation of M and N
  • M zero or more times (Kleene star)
  • M one or more times
  • M? zero or one occurence
  • a-zA-Z character set alternation (choice)
  • . period stands for any single char exc. newline

4
Converting a Regular Expression to an NFA
a
N
M
e
MN
e
e
M
e
e
M
e
N
M
MN
5
Converting an NFA to a DFA
  • For set of states S, closure(s) is the set of
    states that can be reached from S without
    consuming any input.
  • For a set of states S, DFAedge(s, c) is the set
    of states that can be reached from S by consuming
    input symbol c.
  • Each set of NFA states corresponds to one DFA
    state (hence at most 2n states).

6
Extended BNF (EBNF)
  • Rules or productions
  • Variables or non-terminals on LHS
  • Terminals (the prog.Langs tokens)
  • Start symbol (non-terminal)
  • Vertical bar
  • Kleene star
  • Meta-level parentheses of regular expressions

7
Derivations and Parse Trees
Nested constructs require recursion, i.e.
context-free grammars CFG for arithmetic
expressions expression -gt identifier number
- expression
(expression) expression
operator expression operator -gt - /
8
Parse Tree for Slopex Intercept
Is this the only parse tree for this expression
and grammar?
9
A Better Expression Grammar
1. expression -gt term expression add_op
term 2. term -gt factor term mult_op
factor 3. factor -gt identifier number -
factor (expression) 4. add_op -gt - 5.
mult_op -gt / A good grammar reflects the
internal structure of programs. This grammar is
unambiguous and captures (HOW?)- operator
precedence (,/ bind tighter than ,- )-
associativity (ops group left to right)
10
And Better Parse Trees...
3 4 5
10 - 4 - 3
11
Syntax-directed Translation
  • Parser calls scanner to obtain tokens.
  • Assembles tokens into parse tree.
  • Passes tree to later phases of compilation.
  • Scanner deterministic finite automaton.
  • Parser pushdown automaton.
  • Scanners and parsers can be generated
    automatically from regular expressions and CFGs
    (e.G. lex/yacc, see assignment 1).

12
Deeper Into the Details of Scanning.
13
Scanning
  • Accept the longest possible token in each
    invocation of the scanner.
  • Implementation.
  • Ad hoc.
  • Capture finite automaton.
  • Case(switch) statements.
  • Table and driver.
  • Impurities.
  • Handling of keywords (look up ident. In hash
    table).
  • Need to peek (look) ahead (e.G. . Vs ..).

14
Scanner for Pascal
15
Scanner for Pascal(case Statements)
16
Scanner (Tabledriver)
17
Scanner Generators
  • Start with a regular expression.
  • Construct an NFA from it.
  • Use a set of subsets construction to obtain an
    equivalent DFA.
  • Construct the minimal equivalent DFA.

18
Example of scanner generation
Language Strings of 0s and 1s in which the
number of 0s is even Regular expression
(1010)1
19
NFA -gt DFA conversion
hardcopy
  • A state of the DFA after reading a given input
    letter represents the set of states that the NFA
    might have reached with the same input letter.
  • Each state of the DFA that contains a final state
    of the NFA is a final state of the DFA.
  • Number of states of the DFA exponential (in the
    worst case) in the number of states of the NFA.

20
Obtaining the minimal equivalent DFA
  • Initially two equivalence classes final and
    nonfinal states.
  • Search for an equivalence class C and an input
    letter a such that with a as input, the states in
    C make transitions to states in kgt1 different
    equivalence classes.
  • Partition C into k classes accordingly
  • Repeat until unable to find a class to partition.

21
Example of obtaining the minimal equivalent DFA
Initial classesA, B, E, C, D No class
requires partitioning! Hence a two-state DFA is
obtained.
22
Example (cont.)
23
Deeper Into the Details of Parsing
24
Parsing approaches (for linear time performance)
  • Parsing in general has O(n3) cost.
  • Need classes of grammars that can be parsed in
    linear time
  • Top-down or predictive parsing orrecursive
    descent parsingor LL parsing (Left-to-right
    Left-most)
  • Bottom-up or shift-reduce parsing orLR parsing
    (Left-to-right Right-most)

25
Top-down Parsing
  • Predicts a derivation
  • Matches non-terminal against token observed in
    input

26
LL(1) Grammar
  • A grammar for which a top-down determistic parser
    can be producedwith one token of look-ahead.
  • How can one tell whether a grammar is LL(1)?
  • Define s-grammar first
  • Then generalize to LL(1) grammar

27
S-grammar
  • The RHS side of each production begins with a
    terminal.
  • Where a nonterminal appears as the LHS of more
    than one production, the corresponding RHSs begin
    with different terminals.

28
Examples
  • An s-grammar
  • S-gtpX
  • S-gtqY
  • X-gtaXb
  • X-gtx
  • Y-gtaYd
  • Y-gty
  • Not an s-grammar
  • S-gtR
  • S-gtT
  • R-gtpX
  • T-gtqY
  • X-gtaXb
  • X-gtx
  • Y-gtaYd
  • Y-gty

29
Example of Left-to-Right Leftmost Derivation
Input paaaxbbb
  • DerivationS
  • pX
  • paXb
  • paaXbb
  • paaaXbbb
  • paaaxbbb

Where the leftmost non-terminal can be replaced
using more than one production, the appropriate
production can be chosen by examining the next
symbol of the input.
30
Starter Symbols
  • A terminal a is a starter symbol for nonterminal
    A iff A gt a awhere a is a string of terminals
    and/or nonterminals.S(A) set of starter symbols
    for A.
  • A terminal a is a starter symbol for a iffa gtab
    where a, b are strings of terminals and/or
    nonterminals.

31
Director Symbols
  • The director symbols of nonterminal A are
  • S(A), and
  • If A can generate the empty string, then all the
    symbols which can follow A

32
LL(1) Grammar Definition
  • For every nonterminal appearing in the LHS of
    more than one productionthe sets of director
    symbols corresponding to the RHSs of the
    alternative productions are disjoint.
  • All LL(1) grammars can be parsed
    deterministically top down.
  • An algorithm exists for automatically determining
    whether a grammar is LL(1).

33
Example
(hardcopy)
T-gtAB A-gtPQ A-gtBC P-gtpP P-gteQ-gtqQ Q-gte B-gtbB B-gte
C-gtcC C-gtf
Director symbols of A (from A-gtPQ) p, q, b,
e Director symbols of A (from A-gtBC) b, e
34
LL(1) Languages
  • Do all languages possess an LL(1) grammar? (No).
  • If not, is there an algorithm to determine
    whether a language is LL(1)? (No).
  • The obvious grammar for most programming
    languages is not LL(1).
  • Given a non-LL(1) grammar that describes an LL(1)
    language, can it be transformed into LL(1) form?
    (Yes in special cases useful in practice).

35
Bottom-up Parsing
  • LR left-to-right, right-most derivation(bottom-u
    p parser)
  • Shifts new leaves from scanner into a forest of
    partially completed parse tree fragments
  • At some point it realizes that it has complete
    right-hand side, which it can reduce.

36
Bottom-up parsing (2)
  • The symbols joined together are called a handle
  • Keep track of the productions we might be in the
    middle of.
  • Characteristic Finite State Machine in LR
    parsing its states are the various possible sets
    of productions.
  • CFSM recognizes grammars viable prefixes.

37
A Simple Grammar for a Comma-separated List of
Identifiers
hardcopy
id_list -gt id id_list_tail id_list_tail -gt , id
id_list_tail id_list_tail -gt
_________________________ String to be parsed
A, B, C
38
Top-down/bottom-up Parsing
39
Stack Contents (Roots of Partial Trees) in
Bottom-up Parsing
40
A more realistic Example the Calculator Language
41
A Sum-and-average Program
hardcopy
read A read B sum A B write sum write sum / 2
42
LL(1) Grammar for Calculator Language
hardcopy
43
Recursive Descent Parser
hardcopy
44
Parse Tree for Sum-and-avg Program
45
LR(1) Grammar for the Calculator Language
hardcopy
46
LR(1) Grammar for the Calculator Language
  • LR(1) version uses
  • Left recursion for stmt_list
  • Left recursion for expr and term
  • Key concept
  • Figure out when you reach the end of a RHS.
  • Keep track of the set of productions we may be in
    the middle of, and where in these productions.

47
Bottom-up Parsing Overview
  • LR(1) parsers loop over inspection of a look-up
    table to find out what action to take.
  • Variants differ in how to resolve conflicts.
  • Most common is LALR(1)

48
A birds-eye view of grammar and language classes
49
Relationships Among Grammar Classes
LALR(1) is a standardfor programming languages
andautomatic parsergenerators
50
Relationships Among Language Classes
51
Examples of Languages(Proofs Beyond the Scope of
This Class)
Write a Comment
User Comments (0)
About PowerShow.com