Lexical Analysis Part 1 - PowerPoint PPT Presentation

1 / 33
About This Presentation
Title:

Lexical Analysis Part 1

Description:

Automaton is a good 'visual' aid. but is not suitable as a specification ... Translate regular expressions to Non-deterministic Finite Automata (NFA) ... – PowerPoint PPT presentation

Number of Views:72
Avg rating:3.0/5.0
Slides: 34
Provided by: Shmue2
Category:

less

Transcript and Presenter's Notes

Title: Lexical Analysis Part 1


1
Lexical AnalysisPart 1
  • CMSC 431
  • Shon Vick

2
Lexical Analysis Whats to come
  • Programs could be made from characters, and parse
    trees would go down to the character level
  • Machine specific, obfuscates parsing, cumbersome
  • Lexical analysis is firewall between program
    representation and parsing actions
  • Prior lexical analysis phase obtains tokens
    consisting of a type (ID) and value (the lexeme
    matched)
  • In Principle simple transition diagrams (finite
    state automata) characterize each of the things
    that can be recognized
  • In Practice a program combines the multiple
    automata definitions into an efficient state
    machine

3
Lexical Phase
  • Simple (non-recursive)
  • Efficient (special purpose code)
  • Portable (ignore character-set and architecture
    differences)
  • Use JavaCC, lex , flex , etc
  • Used in practice with Bison/Yacc , etc.

4
Lexical Processing
  • Token terminal symbols in a grammar. At the
    lexical level this is a symbol constant, and in
    print is represented in bold
  • Pattern set of matching strings. For a keyword
    it is a constant. For a variable or value it can
    be represented by a regular expression
  • Lexeme character sequence matched by an instance
    of the token

5
Lexical Processing
  • Token attributes pointer to a symbol-table
    entry, may include the lexeme, scope information,
    etc.
  • Languages may have special rules (i.e., PL/1 does
    not have Reserved words and Fortran allows
    spaces in variables both are obscure design
    choices)

6
Lexical Analysis sequences
  • Expression
  • Base base - 0x4 height width
  • Token sequence
  • Namebase operatortimes namebase operatorminus
    hexConstant4 operatortimes nameheight
    operatortimes namewidth
  • Lexical phase returns token and value (yylval ,
    yytext, etc)

7
Tokens
  • Token attributes pointer to a symbol-table
    entry, may include the lexeme, scope information,
    etc.
  • Formal specification of tokens by regular
    expressions, define alphabet, strings, languages

8
Regular Expression Notation
  • a an ordinary letter from our alphabet
  • e the empty string
  • r1 r2 choosing from r1 or r2
  • r1r2 concatenation of r1 and r2
  • r zero or more times (Kleene closure)
  • r one or more times
  • r? zero or one occurrence
  • a-zA-Z character class (choice)
  • . period stands for any single char exc. newline

9
Semantics of Regular Expressions
  • L(e) e
  • L(a) a for all a in S
  • L (r1 r2) L(r1) U L (r2)
  • L (r1 r2) x,y) x in L(r1 ), y in L(r2 )
  • L (R) e U x in L(R ) ,
  • x1 x2 x1 ,x2 in L(R )
  • x1 . . . xn x1. xn in L(R
    )

10
For Homework
  • Suppose S is a ,b
  • What is the regular expression for
  • All strings beginning and ending in a?
  • All strings with an odd number of as?
  • All strings without two consecutive as?
  • All strings with an odd number of bs followed by
    an even number of as
  • Whats the description for a Java floating point
    number?
  • Whats the description of variable name in Java?

11
Why we care about Regular Expressions
For every regular expression, there is a
deterministic finite-state machine that defines
the same language, and vice versa
12
Regular Expressions
  • Automaton is a good visual aid
  • but is not suitable as a specification (its
    textual description is too clumsy)
  • However regular expressions are a suitable
    specification
  • a compact way to define a language that can be
    accepted by an automaton.

13
RegExp Use and Construction
  • Used as the input to a scanner generator like lex
    or flex or JavaCC
  • define each token, and also
  • define white-space, comments, etc
  • these do not correspond to tokens, but must be
    recognized and ignored.
  • A NFA can be constructed from a RegExp via
    Thompsons Construction

14
Thompsons Construction
  • There are building blocks for each regular
    expression operator
  • More complex RegExps are constructed by composing
    smaller building blocks
  • Assumes that the NFAs at each step of the
    construction will have a single accepting state

15
Regular Expressions to NFA (1)
  • For each kind of rexp, define an NFA
  • Notation NFA for rexp M
  • For ?
  • For input a

16
Regular Expressions to NFA (2)
  • For A B
  • For A B

17
Regular Expressions to NFA (3)
  • For A

18
Others
  • What would be representation for A ?
  • What would be representation for A? ?
  • What about for a-z ?

19
Example of RegExp -gt NFA conversion
  • Consider the regular expression
  • (10)1
  • The NFA is

?
1
?
?
C
E
1
B
A
G
?
H
I
J
0
?
?
?
?
D
F
?
20
More Homework Problems
  • What is the NFA for the following RE?
  • (a(bc)) a
  • What is the NFA for the following RE?
  • ((ab)c) (a b c)

21
Lexical Analyzer
  • Can be programmed in a high-level language.
  • Can be generated using tools like LEX/Flex
  • Integrate these tools with C/C or Java code
  • In Java there are other tools Jflex for example

22
How can a tool like LEX or JAVACC work?
  • Translate regular expressions to
    Non-deterministic Finite Automata (NFA)
  • Easier expressive form than the DFA
  • Automata theory tells us how to optimize
  • Run the automata
  • Simulate NFA, or
  • Translate NFA to DFA a new DFA where each state
    corresponds to a set of NFA states (see pgages
    28-29 pf Appel for set construction)
  • Have DFA move between states in simulation of the
    NFAs states

23
Non-deterministic FA
  • NFA is modified to allow zero, one or MORE
    transitions from a state on the same input symbol
  • Easier to express complex patterns as NFA
  • Harder to mechanically simulate NFS what
    transition do we make on input (simulate all of
    them, then confirm it worked)
  • DFA and NFA are functionally equivalent.

24
DFA with null moves
  • The model of NFA can be extended to include
    transitions on ltnullgt input.
  • Change the state without reading any symbol from
    the input stream.
  • e-closure(q) set of all states reachable from q
    without reading any input symbol (following the
    null edges)

25
eClosure Operator
  • The eClosure operator is defined as eClosure(s)
    s U states reachable from s using e
    transitions.
  • Example eClosure(1) 1,3

a
?
start
1
5
3
a
a/b
b
2
4
26
RE to FA
  • If we write expression as RE (easy for people)
    how do we turn it into an FA (something a machine
    can simulate)
  • Use Thompsons Construction
  • At most twice as many states as there are symbols
    and operators in the regular expression.
  • Results in a NFA (needs a non-deterministic
    computer to run most efficiently, hmm.)

27
NFA to DFA
  • Build super states in a DFA where each super
    state represents the set of transitions that the
    NFA could make from a state on a symbol
  • e-closure(q) states that can be arrived at from
    q with just null transitions
  • move(S, a) states that can be reached on
    scanning a symbol a (from the input)
  • e-closure(S) states that can be reached with E
    transitions from states in S

28
NFA to DFA (cont.)
  • Subset Construction (alg 3.2)
  • Find e-closure(q0)
  • while ( S in FAStates is unmarked)
  • mark S
  • for each a in alphabet
  • T e-closure ( move(S, a) )
  • if (T ? FAStates)
  • FAStates.include( T )
  • FATranS, a T

29
FA v.s. NFA
  • NFA is smaller O(r) space but more time for
    simulation O(rx) time even with the nice
    properties of Thompsons construction
  • DFA is faster O(x) time, but is not space
    efficient, O(2r) space

30
NFA t DFA
  • What is the difference between the two?
  • Is there a single DFA for a corresponding NFA?
  • Why do we want to do this anyway?

31
Subset Construction for NFA-gt DFA
  • Compute A eClosure(start)
  • Compute the set of states reachable from A on
    transition a, call this new set S
  • Compute eClosure(S) this is the new state and
    label it with the next available label
  • Continue for all possible transitions from the
    current state for all applicable elements of S
  • Repeat steps 2-4 for each new state

32
Example a cb
e
a
c
e
e
b
1
2
3
6
4
5
e
33
References
  • Compilers Principles, Techniques and Tools, Aho,
    Sethi, Ullman Chapter 3
  • http//www.cs.columbia.edu/lerner/CS4115
  • Modern Compiler Implementation in Java, Andrew
    Appel, Cambridge University Press, 2003
Write a Comment
User Comments (0)
About PowerShow.com