CS-338 Compiler Design - PowerPoint PPT Presentation

1 / 48
About This Presentation
Title:

CS-338 Compiler Design

Description:

Title: PowerPoint Presentation - Introduction to Compiler Construction Author: Robert van Engelen Last modified by: noman Created Date: 1/5/2005 12:36:11 AM – PowerPoint PPT presentation

Number of Views:86
Avg rating:3.0/5.0
Slides: 49
Provided by: Robertva6
Category:

less

Transcript and Presenter's Notes

Title: CS-338 Compiler Design


1
CS-338Compiler Design
  • Dr. Syed Noman Hasany
  • Assistant Professor
  • College of Computer, Qassim University

2
Chapter 3 Lexical Analyzer
  • THE ROLE OF LEXICAL ANALYSER
  • It is the first phase of the compiler.
  • It reads the input characters and produces as
    output a sequence of tokens that the parser uses
    for syntax analysis.
  • It strips out from the source program comments
    and white spaces in the form of blank , tab and
    newline characters .
  • It also correlates error messages from the
    compiler with the source program (because it
    keeps track of line numbers).

3
Interaction of the Lexical Analyzer with the
Parser
Token,tokenval
LexicalAnalyzer
Parser
SourceProgram
Get nexttoken
error
error
Symbol Table
4
The Reason Why Lexical Analysis is a Separate
Phase
  • Simplifies the design of the compiler
  • LL(1) or LR(1) parsing with 1 token lookahead
    would not be possible (multiple characters/tokens
    to match)
  • Provides efficient implementation
  • Systematic techniques to implement lexical
    analyzers by hand or automatically from
    specifications
  • Stream buffering methods to scan input
  • Improves portability
  • Non-standard symbols and alternate character
    encodings can be normalized (e.g. trigraphs)

5
Attributes of Tokens
Lexical analyzer
y 31 28x
ltid, ygt ltassign, gt ltnum, 31gt lt, gt ltnum, 28gt
lt, gt ltid, xgt
token
Parser
tokenval(token attribute)
6
Tokens, Patterns, and Lexemes
  • A token is a classification of lexical units
  • For example id and num
  • Lexemes are the specific character strings that
    make up a token
  • For example abc and 123
  • Patterns are rules describing the set of lexemes
    belonging to a token
  • For example letter followed by letters and
    digits and non-empty sequence of digits

7
Tokens, Patterns, and Lexemes
  • A lexeme is a sequence of characters from the
    source program that is matched by a pattern for a
    token.

Token
lexeme
Pattern
8
Tokens, Patterns, and Lexemes
9
3.2 Input Buffering
  • Examining ways of speeding reading the source
    program
  • In one buffer technique, the last lexeme under
    process will be over-written when we reload the
    buffer.
  • Two-buffer scheme handling large look ahead safely

10
3.2.1 Buffer Pairs
  • Two buffers of the same size, say 4096, are
    alternately reloaded.
  • Two pointers to the input are maintained
  • Pointer lexeme_Begin marks the beginning of the
    current lexeme.
  • Pointer forward scans ahead until a pattern match
    is found.

11
If forward at end of first half then begin
reload second half
forwardforward 1 End Else if forward at end
of second half then begin
reload first half move forward
to beginning of first half End Else
forwardforward 1
12
3.2.2 Sentinels
E M eof
C 2 eof eof
13
forwardforward1 If forward EOF then begin
If forward at end of first half then begin
reload second half
forwardforward 1 End Else if forward at end
of second half then begin
reload first half move forward
to beginning of first half End Else terminate
lexical analysis
14
Specification of Patterns for Tokens Definitions
  • An alphabet ? is a finite set of symbols
    (characters)
  • A string s is a finite sequence of symbols from ?
  • ?s? denotes the length of string s
  • ? denotes the empty string, thus ??? 0
  • A language is a specific set of strings over some
    fixed alphabet ?

15
Specification of Patterns for Tokens String
Operations
  • The concatenation of two strings x and y is
    denoted by xy
  • The exponentation of a string s is defined
    by s0 ? (Empty string a string of length
    zero) si si-1s for i gt 0note that s? ?s
    s

16
Specification of Patterns for Tokens Language
Operations
  • Union L ? M s ? s ? L or s ? M
  • Concatenation LM xy ? x ? L and y ? M
  • Exponentiation L0 ? Li Li-1L
  • Kleene closure L ?i0,,? Li
  • Positive closure L ?i1,,? Li

17
Language Operations Examples
L A, B, C, D D 1, 2, 3
L ? D A, B, C, D, 1, 2, 3 LD A1, A2, A3,
B1, B2, B3, C1, C2, C3, D1, D2, D3 L2 AA,
AB, AC, AD, BA, BB, BC, BD, CA, DD L4 L2 L2
?? L All possible strings of L plus ?
L L - ? L (L ? D ) ?? L (L ? D ) ??
18
Specification of Patterns for Tokens Regular
Expressions
  • Basis symbols
  • ? is a regular expression denoting language ?
  • a ? ? is a regular expression denoting a
  • If r and s are regular expressions denoting
    languages L(r) and M(s) respectively, then
  • r?s is a regular expression denoting L(r) ? M(s)
  • rs is a regular expression denoting L(r)M(s)
  • r is a regular expression denoting L(r)
  • (r) is a regular expression denoting L(r)
  • A language defined by a regular expression is
    called a regular set

19
  • Examples
  • let
  • a b
  • (a b) (a b)
  • a
  • (a b)
  • a ab
  • We assume that has the highest precedence and
    is left associative. Concatenation has second
    highest precedence and is left associative and
    has the lowest precedence and is left
    associative
  • (a) ((b)(c ) ) a bc

20
Algebraic Properties of Regular Expressions
21
Finite Automaton
  • Given an input string, we need a machine that
    has a regular expression hard-coded in it and can
    tell whether the input string matches the pattern
    described by the regular expression or not.
  • A machine that determines whether a given string
    belongs to a language is called a finite
    automaton.

22
Deterministic Finite Automaton
  • Definition Deterministic Finite Automaton
  • a five-tuple (?, S, ?, s0, F) where
  • ? is the alphabet
  • S is the set of states
  • ? is the transition function (S???S)
  • s0 is the starting state
  • F is the set of final states (F ? S)
  • Notation
  • Use a transition diagram to describe a DFA
  • states are nodes, transitions are directed,
    labeled edges, some states are marked as final,
    one state is marked as starting
  • If the automaton stops at a final state on end of
    input, then the input string belongs to the
    language.

23
? a
  • ? a
  • L a
  • S 1,2
  • ? (1,a)2
  • S0 1
  • F 2

24
? ab
  • ? a,b
  • L a,b
  • S 1,2
  • ? (1,a)2, ? (1,b)2
  • S0 1
  • F 2

25
? a(ab)
  • ? a,b
  • L aa,ab
  • S 1,2,3
  • ? (1,a)2, ? (2,a)3, ? (2,b)3
  • S0 1
  • F 3

26
? a
  • ? a
  • L ?,a,aa,aaa,aaaa,
  • S 1
  • ? (1, ?)1, ? (1,a)1
  • S0 1
  • F 1

27
?a?
  • ? a
  • L a,aa,aaa,aaaa,
  • S 1,2
  • ? (1,a)2, ? (2,a)2
  • S0 1
  • F 2
  • Note a?aa

28
? (ab)(ab)b
  • ? a,b
  • L aab,abb,bab,bbb
  • S 1,2,3,4
  • ?(1,a)2, ?(1,b)2, ?(2,a)3, ?(2,b)3,
  • ?(3,b)4
  • S0 1
  • F 4

29
? (ab)
  • ? a,b
  • L?,a,b,aa,bb,ba,ab,aaa,,bbb,,abab,,baba,bbba,
    ,
  • S 1
  • ? (1,a)1, ? (1,b)1
  • S0 1
  • F 1

30
? (ab)?
  • ? a,b
  • L a,aa,aaa,,b,bb,bbb,
  • S 1,2
  • ? (1,a)2, ? (1,b)2, ? (2,a)2, ? (2,b)2
  • S0 1
  • F 2
  • Note (ab)?(ab)(ab)

31
?a?b?
  • ? a,b
  • L a,aa,aaa,,b,bb,bbb,
  • S 1,2,3
  • ? (1,a)2, ? (2,a)2, ? (1,b)3, ? (3,b)3
  • S0 1
  • F 2,3

32
?a(ab)
  • ? a,b
  • La,aa,ab,,aba,,abb,,baa,abbb,,bababa,
  • S 1,2
  • ? (1,a)2, ?(2,a)2, ?(2,b)2
  • S0 1
  • F 2

33
?a(ba)b?
  • ? a,b
  • L aab,abb,aabb,,abbb,abbbb,
  • S 1,2,3,4
  • (1,a)2, ?(2,a)3, ?(2,b)3, ?(3,b)4,
  • ?(4,b)4
  • S0 1
  • F 4

34
? aba(a?b?)
  • ? a,b
  • L aaa,aab,abaa,abbaa,,abbab,abbabbb,
  • S 1,2,3,4,5
  • (1,a)2, ?(2,b)2, ?(2,a)3, ?(3,a)4, ?(4,a)4,
  • (3,b)5, ?(5,b)5
  • S0 1
  • F 4,5

35
Specification of Patterns for Tokens Regular
Definitions
  • Regular definitions introduce a naming
    convention d1 ? r1 d2 ? r2 dn ? rn where
    each ri is a regular expression over ? ? d1,
    d2, , di-1
  • Any dj in ri can be textually substituted in ri
    to obtain an equivalent set of definitions

36
Specification of Patterns for Tokens Regular
Definitions
  • Exampleletter ? A?B??Z?a?b??z digit ?
    0?1??9 id ? letter ( letter?digit )
  • Regular definitions are not recursivedigits ?
    digit digits?digit wrong!

37
Specification of Patterns for Tokens Notational
Shorthand
  • The following shorthands are often used
    r rr r? r?? a-z a?b?c??z
  • Examplesdigit ? 0-9num ? digit (. digit)?
    ( E (?-)? digit )?

38
Regular Definitions and Grammars
Grammar
stmt ? if expr then stmt ? if expr then
stmt else stmt ? ? expr ? term relop
term ? termterm ? id ? num
Regular definitions
if ? if then ? then else ? elserelop
? lt ? lt ? ltgt ? gt ? gt ? id ? letter (
letter digit ) num ? digit (. digit)? ( E
(?-)? digit )?
39
Constructing Transition Diagrams for Tokens
  • Transition Diagrams (TD) are used to represent
    the tokens these are automatons!
  • As characters are read, the relevant TDs are
    used to attempt to match lexeme to a pattern
  • Each TD has
  • States Represented by Circles
  • Actions Represented by Arrows between states
  • Start State Beginning of a pattern
    (Arrowhead)
  • Final State(s) End of pattern (Concentric
    Circles)
  • Each TD is Deterministic - No need to choose
    between 2 different actions !

40
Example All RELOPs
41
Example TDs id and delim
Keyword or id
delim
42
Combine TD for KW and IDs
  • Install_id() decides for the attribute
  • It will check the accepted lexeme in the list of
    keywords if it is matched, zero is returned.
  • Otherwise checks the lexeme in symbol table, if
    it is found, the address is returned.
  • If the lexeme not found in symbol table,
    install_id() first installs the ID in the symbol
    table and return the address of the newly created
    entry.
  • Gettoken() decides for the token
  • If zero returned by install_id(), the same
    word(or its numeric form) is returned as token
  • Otherwise token ID is returned.

43
Example TDs Unsigned s
Questions Is ordering important for unsigned
s ? Why are there no TDs
for then, else, if ?
44
Keywords Recognition
All Keywords / Reserved words are matched as ids
  • After the match, the symbol table or a special
    keyword table is consulted
  • Keyword table contains string versions of all
    keywords and associated token values
  • If a match is not found, then it is assumed
    that an id has been discovered

45
Transition Diagrams Lexical Analyzers
state 0 token nexttoken() while(1)
switch (state) case 0 c
nextchar() / c is lookahead character
/ if (c blank ctab c
newline) state 0
lexeme_beginning / advance
beginning of lexeme / else
if (c lt) state 1 else if (c
) state 5 else if (c gt)
state 6 else state fail()
break / cases 1-8 here /
46
case 9 c nextchar() if
(isletter(c)) state 10 else state
fail() break case 10 c
nextchar() if (isletter(c)) state
10 else if (isdigit(c)) state 10
else state 11 break
case 11 retract(1) install_id()
return ( gettoken() ) / cases 12-24
here / case 25 c nextchar()
if (isdigit(c)) state 26 else state
fail() break case 26 c
nextchar() if (isdigit(c)) state
26 else state 27 break
case 27 retract(1) install_num()
return ( NUM )
Case numbers correspond to transition diagram
states !
47
When Failures Occur
int state 0, start 0 Int lexical_value
/ to return second component of token / Init
fail() forward token_beginning
switch (start) case 0 start 9
break case 9 start 12 break
case 12 start 20 break case 20
start 25 break case 25 recover()
break default / compiler error /
return start
48
Using a Lex Generator
Lex Compiler
  • Lex source prog ?
    ? lex.yy.c
  • lex.l
  • lex.yy.c ?
    ? a.out
  • Input stream ?
    ? sequence of input.c
    tokens

C compiler
a.out
Write a Comment
User Comments (0)
About PowerShow.com