CSCE 330 Programming Language Structures Chapter 3: Lexical and Syntactic Analysis - PowerPoint PPT Presentation

About This Presentation

Title:

CSCE 330 Programming Language Structures Chapter 3: Lexical and Syntactic Analysis

Description:

Input: printable Ascii characters. Output: tokens. Discard: whitespace, comments ... number (sequence of decimal numbers eventually starting with the (minus) sign ... – PowerPoint PPT presentation

Number of Views:220

Avg rating:3.0/5.0

Slides: 63

Provided by: MarcoVa

Learn more at: https://www.cse.sc.edu

Category:

more less

Transcript and Presenter's Notes

Title: CSCE 330 Programming Language Structures Chapter 3: Lexical and Syntactic Analysis

1
CSCE 330Programming Language StructuresChapter
3 Lexical and Syntactic Analysis

Fall 2009
Marco Valtorta
mgv_at_cse.sc.edu
Syntactic sugar causes cancer of the semicolon.
A.Perlis

2
Contents

3.1 Chomsky Hierarchy
3.2 Lexical Analysis
3.3 Syntactic Analysis

3
3.1 Chomsky Hierarchy

Regular grammar -- least powerful
Context-free grammar (BNF)
Context-sensitive grammar
Unrestricted grammar

4
Regular Grammar

Simplest least powerful
Equivalent to
Regular expression
Finite-state automaton
Right regular grammar ? ? T, B ? N
A ? ? B
A ? ?

5
Example

Integer ? 0 Integer 1 Integer ... 9 Integer
0 1 ... 9

6
Regular Grammars

Left regular grammar equivalent
Used in construction of tokenizers (scanners,
lexers)
Less powerful than context-free grammars
Not a regular language
an bn n 1
i.e., cannot balance ( ), , begin end

7
Context-free Grammars

BNF a stylized form of CFG
Equivalent to a pushdown automaton
For a wide class of unambiguous CFGs, there are
table-driven, linear time parsers

8
Context-Sensitive Grammars

Production
a ? ß a ß
a, ß ? (N ? T)
i.e., left-hand side can be composed of strings
of terminals and nonterminals

9
Undecidable Properties of CSGs

Given a string ? and grammar G ? ? L(G)
L(G) is non-empty
Defn Undecidable means that you cannot write a
computer program that is guaranteed to halt to
decide the question for all ? ? L(G).

10
Unrestricted Grammar

Equivalent to
Turing machine
von Neumann machine
C, Java
That is, can compute any computable function.

11
Contents

3.1 Chomsky Hierarchy
3.2 Lexical Analysis
3.3 Syntactic Analysis

12
Lexical Analysis

Purpose transform program representation
Input printable Ascii characters
Output tokens
Discard whitespace, comments
Defn A token is a logically cohesive sequence of
characters representing a single symbol.

13
Example Tokens

Identifiers
Literals 123, 5.67, 'x', true
Keywords bool char ...
Operators - / ...
Punctuation , ( )

14
Other Sequences

Whitespace space tab
Comments
// any-char end-of-line
End-of-line
End-of-file

15
Why a Separate Phase?

Simpler, faster machine model than parser
75 of time spent in lexer for non-optimizing
compiler
Differences in character sets
End of line convention differs

16
Regular Expressions

RegExpr Meaning
x a character x
\x an escaped character, e.g., \n
name a reference to a name
M N M or N
M N M followed by N
M zero or more occurrences of M

RegExpr Meaning
M One or more occurrences of M
M? Zero or one occurrence of M
aeiou the set of vowels
0-9 the set of digits
. Any single character

18
Clite Lexical Syntax

Category Definition
anyChar -
Letter a-zA-Z
Digit 0-9
Whitespace \t
Eol \n
Eof \004

Category Definition
Keyword bool char else false float
if int main true while
Identifier Letter(Letter Digit)
integerLit Digit
floatLit Digit\.Digit
charLit anyChar

Category Definition
Operator ! lt lt gt
gt - / !
Separator . ( )
Comment // (anyChar Whitespace) eol

21
Generators

Input usually regular expression
Output table (slow), code
C/C Lex, Flex
Java JLex

22
Finite State Automata

Set of states representation graph nodes
Input alphabet unique end symbol
State transition function
Labelled (using alphabet) arcs in graph
Unique start state
One or more final states

23
Deterministic FSA

Defn A finite state automaton is deterministic
if for each state and each input symbol, there is
at most one outgoing arc from the state labeled
with the input symbol.

A Finite State Automaton for Identifiers

25
Definitions

A configuration on an FSA consists of a state and
the remaining input.
A move consists of traversing the arc exiting the
state that corresponds to the leftmost input
symbol, thereby consuming it. If no such arc,
then
If no input and state is final, then accept.
Otherwise, error.

An input is accepted if, starting with the start
state, the automaton consumes all the input and
halts in a final state.

27
Example

(S, a2i) (I, 2i)
(I, i)
(I, )
(F, )
Thus (S, a2i) (F, )

28
Some Conventions

Explicit terminator used only for program as a
whole, not each token.
An unlabeled arc represents any other valid input
symbol.
Recognition of a token ends in a final state.
Recognition of a non-token transitions back to
start state.

Recognition of end symbol (end of file) ends in
a final state.
Automaton must be deterministic.
Drop keywords handle separately.
Must consider all sequences with a common prefix
together.

30

31

32
Lexer Code

Parser calls lexer whenever it needs a new token.
Lexer must remember where it left off.
Greedy consumption goes 1 character too far
peek function
pushback function
no symbol consumed by start state

33
From Design to Code

private char ch
public Token next ( )
do
switch (ch)
...
while (true)

34
Remarks

Loop only exited when a token is found
Loop exited via a return statement.
Variable ch must be global. Initialized to a
space character.
Exact nature of a Token irrelevant to design.

35
Translation Rules

Traversing an arc from A to B
If labeled with x test ch x
If unlabeled else/default part of if/switch. If
only arc, no test need be performed.
Get next character if A is not start state

A node with an arc to itself is a do-while.
Condition corresponds to whichever arc is
labeled.

Otherwise the move is translated to a if/switch
Each arc is a separate case.
Unlabeled arc is default case.
A sequence of transitions becomes a sequence of
translated statements.

A complex diagram is translated by boxing its
components so that each box is one node.
Translate each box using an outside-in strategy.

private boolean isLetter(char c)
return ch gt a ch lt z
ch gt A ch lt Z

private String concat(String set)
StringBuffer r new StringBuffer()
do
r.append(ch)
ch nextChar( )
while (set.indexOf(ch) gt 0)
return r.toString( )

public Token next( )
do if (isLetter(ch) // ident or keyword
String spelling concat(lettersdigits)
return Token.keyword(spelling)
else if (isDigit(ch)) // int or float
literal
String number concat(digits)
if (ch ! .)
return Token.mkIntLiteral(number)
number concat(digits)
return Token.mkFloatLiteral(number)

else switch (ch)
case case \t case \r case eolnCh
ch nextCh( ) break
case eofCh return Token.eofTok
case ch nextChar( )
return Token.plusTok
case check() return Token.andTok
case return chkOpt(, Token.assignTok,
Token.eqeqTok)

43
Source Tokens

// a first program
// with 2 comments
int main ( )
char c
int i
c 'h'
i c 3
// main

int
main
(
)
char
Identifier c

44
JLex A Lexical Analyzer Generator for Java
We will look at an example JLex specification
(adopted from the manual). Consult the manual
for details on how to write your own JLex
specifications.
Definition of tokens Regular Expressions
JLex
Java File Scanner Class Recognizes Tokens
45
The JLex tool
Layout of JLex file
user code (added to start of generated
file) options user code (added inside
the scanner class declaration) macro
definitions lexical declaration
User code is copied directly into the output class
JLex directives allow you to include code in the
lexical analysis class, change names of various
components, switch on character counting, line
counting, manage EOF, etc.
Macro definitions gives names for useful regexps
Regular expression rules define the tokens to be
recognised and actions to be taken
46
Java.io.StreamTokenizer

An alternative to JLex is to use the class
StreamTokenizer from java.io
The class recognizes 4 types of lexical elements
(tokens)
number (sequence of decimal numbers eventually
starting with the (minus) sign and/or containing
the decimal point)
word (sequence of characters and digits starting
with a character)
line separator
end of file

47
Parsing

Some terminology
Different types of parsing strategies
bottom up
top down
Recursive descent parsing
What is it
How to implement one given an EBNF specification
(How to generate one using tools later)
(Bottom up parsing algorithms)

48
Parsing Some Terminology

Recognition
To answer the question does the input conform to
the syntax of the language?
Parsing
Recognition determination of phrase structure
(for example by generating AST data structures)
(Un)ambiguous grammar
A grammar is unambiguous if there is only at most
one way to parse any input (i.e. for
syntactically correct program there is precisely
one parse tree)

49
Different kinds of Parsing Algorithms

Two big groups of algorithms can be
distinguished
bottom up strategies
top down strategies
Example parsing of Micro-English

Sentence Subject Verb Object . Subject
I a Noun the Noun Object me a Noun
the Noun Noun cat mat rat Verb like
is see sees
The cat sees the rat. The rat sees me. I like a
cat
The rat like me. I see the rat. I sees a rat.
50
Top-down parsing
The parse tree is constructed starting at the top
(root).
Sentence
The
cat
sees
a
rat
.
The
cat
sees
rat
.
51
Bottom up parsing
The parse tree grows from the bottom (leaves)
up to the top (root).
The
cat
sees
a
rat
.
The
cat
sees
a
rat
.
52
Top-Down vs. Bottom-Up parsing

LL-Analyse (Top-Down) Left-to-Right Left
Derivative Scans string left to right Builds
leftmost derivation
LR-Analyse (Bottom-Up) Left-to-Right Right
Derivative Scans string left to right Builds
rightmost derivation
Reduction
Derivation
Look-Ahead
Look-Ahead
53
Recursive Descent Parsing

Recursive descent parsing is a straightforward
top-down parsing algorithm.
We will now look at how to develop a recursive
descent parser from an EBNF specification.
Idea the parse tree structure corresponds to the
call graph structure of parsing procedures that
call each other recursively.

54
Recursive Descent Parsing
Sentence Subject Verb Object . Subject
I a Noun the Noun Object me a Noun
the Noun Noun cat mat rat Verb like
is see sees
Define a procedure parseN for each non-terminal N
private void parseSentence() private void
parseSubject() private void parseObject()
private void parseNoun() private void
parseVerb()
55
Recursive Descent Parsing
public class MicroEnglishParser private
TerminalSymbol currentTerminal //Auxiliary
methods will go here ... //Parsing methods
will go here ...
56
Recursive Descent Parsing Auxiliary Methods
public class MicroEnglishParser private
TerminalSymbol currentTerminal private void
accept(TerminalSymbol expected) if
(currentTerminal matches expected)
currentTerminal next input terminal else
report a syntax error ...
57
Recursive Descent Parsing Parsing Methods
Sentence Subject Verb Object .
private void parseSentence()
parseSubject() parseVerb()
parseObject() accept(.)
58
Recursive Descent Parsing Parsing Methods
Subject I a Noun the Noun
private void parseSubject() if
(currentTerminal matches I) accept(I)
else if (currentTerminal matches a)
accept(a) parseNoun() else if
(currentTerminal matches the)
accept(the) parseNoun() else
report a syntax error
59
Recursive Descent Parsing Parsing Methods
Noun cat mat rat
private void parseNoun() if (currentTerminal
matches cat) accept(cat) else if
(currentTerminal matches mat)
accept(mat) else if (currentTerminal
matches rat) accept(rat) else
report a syntax error
60
Algorithm to convert EBNF into a RD parser

The conversion of an EBNF specification into a
Java implementation for a recursive descent
parser is so mechanical that it can easily be
automated!
gt JavaCC Java Compiler Compiler
We can describe the algorithm by a set of
mechanical rewrite rules

61
Algorithm to convert EBNF into a RD parser
62
Algorithm to convert EBNF into a RD parser

Write a Comment

User Comments (0)