Title: CSCE 330 Programming Language Structures Chapter 3: Lexical and Syntactic Analysis
1CSCE 330Programming Language StructuresChapter
3 Lexical and Syntactic Analysis
- Fall 2009
- Marco Valtorta
- mgv_at_cse.sc.edu
- Syntactic sugar causes cancer of the semicolon.
- A.Perlis
2Contents
- 3.1 Chomsky Hierarchy
- 3.2 Lexical Analysis
- 3.3 Syntactic Analysis
33.1 Chomsky Hierarchy
- Regular grammar -- least powerful
- Context-free grammar (BNF)
- Context-sensitive grammar
- Unrestricted grammar
4Regular Grammar
- Simplest least powerful
- Equivalent to
- Regular expression
- Finite-state automaton
- Right regular grammar ? ? T, B ? N
- A ? ? B
- A ? ?
5Example
- Integer ? 0 Integer 1 Integer ... 9 Integer
0 1 ... 9
6Regular Grammars
- Left regular grammar equivalent
- Used in construction of tokenizers (scanners,
lexers) - Less powerful than context-free grammars
- Not a regular language
- an bn n 1
- i.e., cannot balance ( ), , begin end
7Context-free Grammars
- BNF a stylized form of CFG
- Equivalent to a pushdown automaton
- For a wide class of unambiguous CFGs, there are
table-driven, linear time parsers
8Context-Sensitive Grammars
- Production
- a ? ß a ß
- a, ß ? (N ? T)
- i.e., left-hand side can be composed of strings
of terminals and nonterminals
9Undecidable Properties of CSGs
- Given a string ? and grammar G ? ? L(G)
- L(G) is non-empty
- Defn Undecidable means that you cannot write a
computer program that is guaranteed to halt to
decide the question for all ? ? L(G).
10Unrestricted Grammar
- Equivalent to
- Turing machine
- von Neumann machine
- C, Java
- That is, can compute any computable function.
11Contents
- 3.1 Chomsky Hierarchy
- 3.2 Lexical Analysis
- 3.3 Syntactic Analysis
12Lexical Analysis
- Purpose transform program representation
- Input printable Ascii characters
- Output tokens
- Discard whitespace, comments
- Defn A token is a logically cohesive sequence of
characters representing a single symbol.
13Example Tokens
- Identifiers
- Literals 123, 5.67, 'x', true
- Keywords bool char ...
- Operators - / ...
- Punctuation , ( )
14Other Sequences
- Whitespace space tab
- Comments
- // any-char end-of-line
- End-of-line
- End-of-file
15Why a Separate Phase?
- Simpler, faster machine model than parser
- 75 of time spent in lexer for non-optimizing
compiler - Differences in character sets
- End of line convention differs
16Regular Expressions
- RegExpr Meaning
- x a character x
- \x an escaped character, e.g., \n
- name a reference to a name
- M N M or N
- M N M followed by N
- M zero or more occurrences of M
17 - RegExpr Meaning
- M One or more occurrences of M
- M? Zero or one occurrence of M
- aeiou the set of vowels
- 0-9 the set of digits
- . Any single character
18Clite Lexical Syntax
- Category Definition
- anyChar -
- Letter a-zA-Z
- Digit 0-9
- Whitespace \t
- Eol \n
- Eof \004
19 - Category Definition
- Keyword bool char else false float
if int main true while - Identifier Letter(Letter Digit)
- integerLit Digit
- floatLit Digit\.Digit
- charLit anyChar
20 - Category Definition
- Operator ! lt lt gt
gt - / ! - Separator . ( )
- Comment // (anyChar Whitespace) eol
21Generators
- Input usually regular expression
- Output table (slow), code
- C/C Lex, Flex
- Java JLex
22Finite State Automata
- Set of states representation graph nodes
- Input alphabet unique end symbol
- State transition function
- Labelled (using alphabet) arcs in graph
- Unique start state
- One or more final states
23Deterministic FSA
- Defn A finite state automaton is deterministic
if for each state and each input symbol, there is
at most one outgoing arc from the state labeled
with the input symbol.
24 - A Finite State Automaton for Identifiers
25Definitions
- A configuration on an FSA consists of a state and
the remaining input. - A move consists of traversing the arc exiting the
state that corresponds to the leftmost input
symbol, thereby consuming it. If no such arc,
then - If no input and state is final, then accept.
- Otherwise, error.
26 - An input is accepted if, starting with the start
state, the automaton consumes all the input and
halts in a final state.
27Example
- (S, a2i) (I, 2i)
- (I, i)
- (I, )
- (F, )
- Thus (S, a2i) (F, )
28Some Conventions
- Explicit terminator used only for program as a
whole, not each token. - An unlabeled arc represents any other valid input
symbol. - Recognition of a token ends in a final state.
- Recognition of a non-token transitions back to
start state.
29 - Recognition of end symbol (end of file) ends in
a final state. - Automaton must be deterministic.
- Drop keywords handle separately.
- Must consider all sequences with a common prefix
together.
30 31 32Lexer Code
- Parser calls lexer whenever it needs a new token.
- Lexer must remember where it left off.
- Greedy consumption goes 1 character too far
- peek function
- pushback function
- no symbol consumed by start state
33From Design to Code
- private char ch
- public Token next ( )
- do
- switch (ch)
- ...
-
- while (true)
34Remarks
- Loop only exited when a token is found
- Loop exited via a return statement.
- Variable ch must be global. Initialized to a
space character. - Exact nature of a Token irrelevant to design.
35Translation Rules
- Traversing an arc from A to B
- If labeled with x test ch x
- If unlabeled else/default part of if/switch. If
only arc, no test need be performed. - Get next character if A is not start state
36 - A node with an arc to itself is a do-while.
- Condition corresponds to whichever arc is
labeled.
37 - Otherwise the move is translated to a if/switch
- Each arc is a separate case.
- Unlabeled arc is default case.
- A sequence of transitions becomes a sequence of
translated statements.
38 - A complex diagram is translated by boxing its
components so that each box is one node. - Translate each box using an outside-in strategy.
39 - private boolean isLetter(char c)
- return ch gt a ch lt z
- ch gt A ch lt Z
-
40 - private String concat(String set)
- StringBuffer r new StringBuffer()
- do
- r.append(ch)
- ch nextChar( )
- while (set.indexOf(ch) gt 0)
- return r.toString( )
41 - public Token next( )
- do if (isLetter(ch) // ident or keyword
- String spelling concat(lettersdigits)
- return Token.keyword(spelling)
- else if (isDigit(ch)) // int or float
literal - String number concat(digits)
- if (ch ! .)
- return Token.mkIntLiteral(number)
- number concat(digits)
- return Token.mkFloatLiteral(number)
42 - else switch (ch)
- case case \t case \r case eolnCh
- ch nextCh( ) break
- case eofCh return Token.eofTok
- case ch nextChar( )
- return Token.plusTok
-
- case check() return Token.andTok
- case return chkOpt(, Token.assignTok,
- Token.eqeqTok)
43Source Tokens
- // a first program
- // with 2 comments
- int main ( )
- char c
- int i
- c 'h'
- i c 3
- // main
- int
- main
- (
- )
-
- char
- Identifier c
-
44JLex A Lexical Analyzer Generator for Java
We will look at an example JLex specification
(adopted from the manual). Consult the manual
for details on how to write your own JLex
specifications.
Definition of tokens Regular Expressions
JLex
Java File Scanner Class Recognizes Tokens
45The JLex tool
Layout of JLex file
user code (added to start of generated
file) options user code (added inside
the scanner class declaration) macro
definitions lexical declaration
User code is copied directly into the output class
JLex directives allow you to include code in the
lexical analysis class, change names of various
components, switch on character counting, line
counting, manage EOF, etc.
Macro definitions gives names for useful regexps
Regular expression rules define the tokens to be
recognised and actions to be taken
46Java.io.StreamTokenizer
- An alternative to JLex is to use the class
StreamTokenizer from java.io - The class recognizes 4 types of lexical elements
(tokens) - number (sequence of decimal numbers eventually
starting with the (minus) sign and/or containing
the decimal point) - word (sequence of characters and digits starting
with a character) - line separator
- end of file
47Parsing
- Some terminology
- Different types of parsing strategies
- bottom up
- top down
- Recursive descent parsing
- What is it
- How to implement one given an EBNF specification
- (How to generate one using tools later)
- (Bottom up parsing algorithms)
48Parsing Some Terminology
- Recognition
- To answer the question does the input conform to
the syntax of the language? - Parsing
- Recognition determination of phrase structure
(for example by generating AST data structures) - (Un)ambiguous grammar
- A grammar is unambiguous if there is only at most
one way to parse any input (i.e. for
syntactically correct program there is precisely
one parse tree)
49Different kinds of Parsing Algorithms
- Two big groups of algorithms can be
distinguished - bottom up strategies
- top down strategies
- Example parsing of Micro-English
Sentence Subject Verb Object . Subject
I a Noun the Noun Object me a Noun
the Noun Noun cat mat rat Verb like
is see sees
The cat sees the rat. The rat sees me. I like a
cat
The rat like me. I see the rat. I sees a rat.
50Top-down parsing
The parse tree is constructed starting at the top
(root).
Sentence
The
cat
sees
a
rat
.
The
cat
sees
rat
.
51Bottom up parsing
The parse tree grows from the bottom (leaves)
up to the top (root).
The
cat
sees
a
rat
.
The
cat
sees
a
rat
.
52Top-Down vs. Bottom-Up parsing
LL-Analyse (Top-Down) Left-to-Right Left
Derivative Scans string left to right Builds
leftmost derivation
LR-Analyse (Bottom-Up) Left-to-Right Right
Derivative Scans string left to right Builds
rightmost derivation
Reduction
Derivation
Look-Ahead
Look-Ahead
53Recursive Descent Parsing
- Recursive descent parsing is a straightforward
top-down parsing algorithm. - We will now look at how to develop a recursive
descent parser from an EBNF specification. - Idea the parse tree structure corresponds to the
call graph structure of parsing procedures that
call each other recursively.
54Recursive Descent Parsing
Sentence Subject Verb Object . Subject
I a Noun the Noun Object me a Noun
the Noun Noun cat mat rat Verb like
is see sees
Define a procedure parseN for each non-terminal N
private void parseSentence() private void
parseSubject() private void parseObject()
private void parseNoun() private void
parseVerb()
55Recursive Descent Parsing
public class MicroEnglishParser private
TerminalSymbol currentTerminal //Auxiliary
methods will go here ... //Parsing methods
will go here ...
56Recursive Descent Parsing Auxiliary Methods
public class MicroEnglishParser private
TerminalSymbol currentTerminal private void
accept(TerminalSymbol expected) if
(currentTerminal matches expected)
currentTerminal next input terminal else
report a syntax error ...
57Recursive Descent Parsing Parsing Methods
Sentence Subject Verb Object .
private void parseSentence()
parseSubject() parseVerb()
parseObject() accept(.)
58Recursive Descent Parsing Parsing Methods
Subject I a Noun the Noun
private void parseSubject() if
(currentTerminal matches I) accept(I)
else if (currentTerminal matches a)
accept(a) parseNoun() else if
(currentTerminal matches the)
accept(the) parseNoun() else
report a syntax error
59Recursive Descent Parsing Parsing Methods
Noun cat mat rat
private void parseNoun() if (currentTerminal
matches cat) accept(cat) else if
(currentTerminal matches mat)
accept(mat) else if (currentTerminal
matches rat) accept(rat) else
report a syntax error
60Algorithm to convert EBNF into a RD parser
- The conversion of an EBNF specification into a
Java implementation for a recursive descent
parser is so mechanical that it can easily be
automated! - gt JavaCC Java Compiler Compiler
- We can describe the algorithm by a set of
mechanical rewrite rules
61Algorithm to convert EBNF into a RD parser
62Algorithm to convert EBNF into a RD parser