CSCE 330 Programming Language Structures Chapter 3: Lexical and Syntactic Analysis - PowerPoint PPT Presentation

About This Presentation
Title:

CSCE 330 Programming Language Structures Chapter 3: Lexical and Syntactic Analysis

Description:

Input: printable Ascii characters. Output: tokens. Discard: whitespace, comments ... number (sequence of decimal numbers eventually starting with the (minus) sign ... – PowerPoint PPT presentation

Number of Views:220
Avg rating:3.0/5.0
Slides: 63
Provided by: MarcoVa
Learn more at: https://www.cse.sc.edu
Category:

less

Transcript and Presenter's Notes

Title: CSCE 330 Programming Language Structures Chapter 3: Lexical and Syntactic Analysis


1
CSCE 330Programming Language StructuresChapter
3 Lexical and Syntactic Analysis
  • Fall 2009
  • Marco Valtorta
  • mgv_at_cse.sc.edu
  • Syntactic sugar causes cancer of the semicolon.
  • A.Perlis

2
Contents
  • 3.1 Chomsky Hierarchy
  • 3.2 Lexical Analysis
  • 3.3 Syntactic Analysis

3
3.1 Chomsky Hierarchy
  • Regular grammar -- least powerful
  • Context-free grammar (BNF)
  • Context-sensitive grammar
  • Unrestricted grammar

4
Regular Grammar
  • Simplest least powerful
  • Equivalent to
  • Regular expression
  • Finite-state automaton
  • Right regular grammar ? ? T, B ? N
  • A ? ? B
  • A ? ?

5
Example
  • Integer ? 0 Integer 1 Integer ... 9 Integer
    0 1 ... 9

6
Regular Grammars
  • Left regular grammar equivalent
  • Used in construction of tokenizers (scanners,
    lexers)
  • Less powerful than context-free grammars
  • Not a regular language
  • an bn n 1
  • i.e., cannot balance ( ), , begin end

7
Context-free Grammars
  • BNF a stylized form of CFG
  • Equivalent to a pushdown automaton
  • For a wide class of unambiguous CFGs, there are
    table-driven, linear time parsers

8
Context-Sensitive Grammars
  • Production
  • a ? ß a ß
  • a, ß ? (N ? T)
  • i.e., left-hand side can be composed of strings
    of terminals and nonterminals

9
Undecidable Properties of CSGs
  • Given a string ? and grammar G ? ? L(G)
  • L(G) is non-empty
  • Defn Undecidable means that you cannot write a
    computer program that is guaranteed to halt to
    decide the question for all ? ? L(G).

10
Unrestricted Grammar
  • Equivalent to
  • Turing machine
  • von Neumann machine
  • C, Java
  • That is, can compute any computable function.

11
Contents
  • 3.1 Chomsky Hierarchy
  • 3.2 Lexical Analysis
  • 3.3 Syntactic Analysis

12
Lexical Analysis
  • Purpose transform program representation
  • Input printable Ascii characters
  • Output tokens
  • Discard whitespace, comments
  • Defn A token is a logically cohesive sequence of
    characters representing a single symbol.

13
Example Tokens
  • Identifiers
  • Literals 123, 5.67, 'x', true
  • Keywords bool char ...
  • Operators - / ...
  • Punctuation , ( )

14
Other Sequences
  • Whitespace space tab
  • Comments
  • // any-char end-of-line
  • End-of-line
  • End-of-file

15
Why a Separate Phase?
  • Simpler, faster machine model than parser
  • 75 of time spent in lexer for non-optimizing
    compiler
  • Differences in character sets
  • End of line convention differs

16
Regular Expressions
  • RegExpr Meaning
  • x a character x
  • \x an escaped character, e.g., \n
  • name a reference to a name
  • M N M or N
  • M N M followed by N
  • M zero or more occurrences of M

17
  • RegExpr Meaning
  • M One or more occurrences of M
  • M? Zero or one occurrence of M
  • aeiou the set of vowels
  • 0-9 the set of digits
  • . Any single character

18
Clite Lexical Syntax
  • Category Definition
  • anyChar -
  • Letter a-zA-Z
  • Digit 0-9
  • Whitespace \t
  • Eol \n
  • Eof \004

19
  • Category Definition
  • Keyword bool char else false float
    if int main true while
  • Identifier Letter(Letter Digit)
  • integerLit Digit
  • floatLit Digit\.Digit
  • charLit anyChar

20
  • Category Definition
  • Operator ! lt lt gt
    gt - / !
  • Separator . ( )
  • Comment // (anyChar Whitespace) eol

21
Generators
  • Input usually regular expression
  • Output table (slow), code
  • C/C Lex, Flex
  • Java JLex

22
Finite State Automata
  • Set of states representation graph nodes
  • Input alphabet unique end symbol
  • State transition function
  • Labelled (using alphabet) arcs in graph
  • Unique start state
  • One or more final states

23
Deterministic FSA
  • Defn A finite state automaton is deterministic
    if for each state and each input symbol, there is
    at most one outgoing arc from the state labeled
    with the input symbol.

24
  • A Finite State Automaton for Identifiers

25
Definitions
  • A configuration on an FSA consists of a state and
    the remaining input.
  • A move consists of traversing the arc exiting the
    state that corresponds to the leftmost input
    symbol, thereby consuming it. If no such arc,
    then
  • If no input and state is final, then accept.
  • Otherwise, error.

26
  • An input is accepted if, starting with the start
    state, the automaton consumes all the input and
    halts in a final state.

27
Example
  • (S, a2i) (I, 2i)
  • (I, i)
  • (I, )
  • (F, )
  • Thus (S, a2i) (F, )

28
Some Conventions
  • Explicit terminator used only for program as a
    whole, not each token.
  • An unlabeled arc represents any other valid input
    symbol.
  • Recognition of a token ends in a final state.
  • Recognition of a non-token transitions back to
    start state.

29
  • Recognition of end symbol (end of file) ends in
    a final state.
  • Automaton must be deterministic.
  • Drop keywords handle separately.
  • Must consider all sequences with a common prefix
    together.

30

31

32
Lexer Code
  • Parser calls lexer whenever it needs a new token.
  • Lexer must remember where it left off.
  • Greedy consumption goes 1 character too far
  • peek function
  • pushback function
  • no symbol consumed by start state

33
From Design to Code
  • private char ch
  • public Token next ( )
  • do
  • switch (ch)
  • ...
  • while (true)

34
Remarks
  • Loop only exited when a token is found
  • Loop exited via a return statement.
  • Variable ch must be global. Initialized to a
    space character.
  • Exact nature of a Token irrelevant to design.

35
Translation Rules
  • Traversing an arc from A to B
  • If labeled with x test ch x
  • If unlabeled else/default part of if/switch. If
    only arc, no test need be performed.
  • Get next character if A is not start state

36
  • A node with an arc to itself is a do-while.
  • Condition corresponds to whichever arc is
    labeled.

37
  • Otherwise the move is translated to a if/switch
  • Each arc is a separate case.
  • Unlabeled arc is default case.
  • A sequence of transitions becomes a sequence of
    translated statements.

38
  • A complex diagram is translated by boxing its
    components so that each box is one node.
  • Translate each box using an outside-in strategy.

39
  • private boolean isLetter(char c)
  • return ch gt a ch lt z
  • ch gt A ch lt Z

40
  • private String concat(String set)
  • StringBuffer r new StringBuffer()
  • do
  • r.append(ch)
  • ch nextChar( )
  • while (set.indexOf(ch) gt 0)
  • return r.toString( )

41
  • public Token next( )
  • do if (isLetter(ch) // ident or keyword
  • String spelling concat(lettersdigits)
  • return Token.keyword(spelling)
  • else if (isDigit(ch)) // int or float
    literal
  • String number concat(digits)
  • if (ch ! .)
  • return Token.mkIntLiteral(number)
  • number concat(digits)
  • return Token.mkFloatLiteral(number)

42
  • else switch (ch)
  • case case \t case \r case eolnCh
  • ch nextCh( ) break
  • case eofCh return Token.eofTok
  • case ch nextChar( )
  • return Token.plusTok
  • case check() return Token.andTok
  • case return chkOpt(, Token.assignTok,
  • Token.eqeqTok)

43
Source Tokens
  • // a first program
  • // with 2 comments
  • int main ( )
  • char c
  • int i
  • c 'h'
  • i c 3
  • // main
  • int
  • main
  • (
  • )
  • char
  • Identifier c

44
JLex A Lexical Analyzer Generator for Java
We will look at an example JLex specification
(adopted from the manual). Consult the manual
for details on how to write your own JLex
specifications.
Definition of tokens Regular Expressions
JLex
Java File Scanner Class Recognizes Tokens
45
The JLex tool
Layout of JLex file
user code (added to start of generated
file)   options user code (added inside
the scanner class declaration)   macro
definitions lexical declaration
User code is copied directly into the output class
JLex directives allow you to include code in the
lexical analysis class, change names of various
components, switch on character counting, line
counting, manage EOF, etc.
Macro definitions gives names for useful regexps
Regular expression rules define the tokens to be
recognised and actions to be taken
46
Java.io.StreamTokenizer
  • An alternative to JLex is to use the class
    StreamTokenizer from java.io
  • The class recognizes 4 types of lexical elements
    (tokens)
  • number (sequence of decimal numbers eventually
    starting with the (minus) sign and/or containing
    the decimal point)
  • word (sequence of characters and digits starting
    with a character)
  • line separator
  • end of file

47
Parsing
  • Some terminology
  • Different types of parsing strategies
  • bottom up
  • top down
  • Recursive descent parsing
  • What is it
  • How to implement one given an EBNF specification
  • (How to generate one using tools later)
  • (Bottom up parsing algorithms)

48
Parsing Some Terminology
  • Recognition
  • To answer the question does the input conform to
    the syntax of the language?
  • Parsing
  • Recognition determination of phrase structure
    (for example by generating AST data structures)
  • (Un)ambiguous grammar
  • A grammar is unambiguous if there is only at most
    one way to parse any input (i.e. for
    syntactically correct program there is precisely
    one parse tree)

49
Different kinds of Parsing Algorithms
  • Two big groups of algorithms can be
    distinguished
  • bottom up strategies
  • top down strategies
  • Example parsing of Micro-English

Sentence Subject Verb Object . Subject
I a Noun the Noun Object me a Noun
the Noun Noun cat mat rat Verb like
is see sees
The cat sees the rat. The rat sees me. I like a
cat
The rat like me. I see the rat. I sees a rat.
50
Top-down parsing
The parse tree is constructed starting at the top
(root).
Sentence
The
cat
sees
a
rat
.
The
cat
sees
rat
.
51
Bottom up parsing
The parse tree grows from the bottom (leaves)
up to the top (root).
The
cat
sees
a
rat
.
The
cat
sees
a
rat
.
52
Top-Down vs. Bottom-Up parsing


LL-Analyse (Top-Down) Left-to-Right Left
Derivative Scans string left to right Builds
leftmost derivation
LR-Analyse (Bottom-Up) Left-to-Right Right
Derivative Scans string left to right Builds
rightmost derivation
Reduction
Derivation
Look-Ahead
Look-Ahead
53
Recursive Descent Parsing
  • Recursive descent parsing is a straightforward
    top-down parsing algorithm.
  • We will now look at how to develop a recursive
    descent parser from an EBNF specification.
  • Idea the parse tree structure corresponds to the
    call graph structure of parsing procedures that
    call each other recursively.

54
Recursive Descent Parsing
Sentence Subject Verb Object . Subject
I a Noun the Noun Object me a Noun
the Noun Noun cat mat rat Verb like
is see sees
Define a procedure parseN for each non-terminal N
private void parseSentence() private void
parseSubject() private void parseObject()
private void parseNoun() private void
parseVerb()
55
Recursive Descent Parsing
public class MicroEnglishParser private
TerminalSymbol currentTerminal //Auxiliary
methods will go here ... //Parsing methods
will go here ...
56
Recursive Descent Parsing Auxiliary Methods
public class MicroEnglishParser private
TerminalSymbol currentTerminal private void
accept(TerminalSymbol expected) if
(currentTerminal matches expected)
currentTerminal next input terminal else
report a syntax error ...
57
Recursive Descent Parsing Parsing Methods
Sentence Subject Verb Object .
private void parseSentence()
parseSubject() parseVerb()
parseObject() accept(.)
58
Recursive Descent Parsing Parsing Methods
Subject I a Noun the Noun
private void parseSubject() if
(currentTerminal matches I) accept(I)
else if (currentTerminal matches a)
accept(a) parseNoun() else if
(currentTerminal matches the)
accept(the) parseNoun() else
report a syntax error
59
Recursive Descent Parsing Parsing Methods
Noun cat mat rat
private void parseNoun() if (currentTerminal
matches cat) accept(cat) else if
(currentTerminal matches mat)
accept(mat) else if (currentTerminal
matches rat) accept(rat) else
report a syntax error
60
Algorithm to convert EBNF into a RD parser
  • The conversion of an EBNF specification into a
    Java implementation for a recursive descent
    parser is so mechanical that it can easily be
    automated!
  • gt JavaCC Java Compiler Compiler
  • We can describe the algorithm by a set of
    mechanical rewrite rules

61
Algorithm to convert EBNF into a RD parser
62
Algorithm to convert EBNF into a RD parser
Write a Comment
User Comments (0)
About PowerShow.com