Foundations of Software Design - PowerPoint PPT Presentation

About This Presentation
Title:

Foundations of Software Design

Description:

Title: A Skill Is Born: The Emergence of Web Site Design Skills (1994-1998) Last modified by: hearst Created Date: 7/19/2001 7:37:29 AM Document presentation format – PowerPoint PPT presentation

Number of Views:95
Avg rating:3.0/5.0
Slides: 48
Provided by: berke161
Category:

less

Transcript and Presenter's Notes

Title: Foundations of Software Design


1
Foundations of Software Design
Lecture 24 Compilers, Lexers, and Parsers Intro
to Graphs  Marti Hearst Fall 2002 
2
How Do Computers Work (Revisited)?
Machine Instructions
Bits Bytes
Binary Numbers
3
The Compiler
  • What is a compiler?
  • A recognizer (of some source language L).
  • A translator (of programs written in L into
    programs written in some object or target
    language L').
  • A compiler is itself a program, written in some
    host language
  • Operates in phases

Programming Languages
Assembly Language
Machine Instructions
4
Converting Java to Byte Code
  • When you compile a java program, javac produces
    byte codes (stored in the class file).
  • The byte codes are not converted to machine code.
  • Instead, they are interpreted in the VM when you
    run the program called java.

5
C code
Translated by the C compiler (gcc or cc)
Assembly Language
Creates the JVM once
Machine Code
Java code
Translated by the java compiler (javac or jit)
Java Virtual Machine
Byte code (class file)
Individual program is loaded run in JVM
6
Compiler Compilers
  • Which came first the compiler or the program?
  • The very first one has to be written in assembly
    language!
  • This is why most programming languages today
    start with the C code generator
  • After you have created the first compiler for a
    given language, say java, then you
  • Use that compiler to compile itself!!

7
Compiling Your Compiler
Write the first java compiler using C
Write the second java compiler using java
Compile using gcc
Compile using javac
Javac in C
Javac in java
Write other java programs
Compile using javac
8
Compiler in more detail.
Lexical analyzer (scanner)
Syntax analyzer (parser)
Semantic analyzer
Intermediate Code Generator
Optimizer
Code Generator
9
The Scanner
  • Task
  • Translate the sequence of characters into a
    corresponding sequence of tokens (by grouping
    characters into lexemes).
  • How its done
  • Specify lexemes using Regular Expressions
  • Convert these Regular Expressions into Finite
    Automata

10
Lexemes and Tokens
  • Here are some Java lexemes and the corresponding
    tokens
  • index tmp 37 102
  • SEMI-COLON ASSIGN IDENT IDENT INT-LIT INT-LIT
  • Note that multiple lexemes can correspond to the
    same token (e.g., there are many identifiers).
  • Given the source code
  • position initial rate 60
  • a Java scanner would return the following
    sequence of tokens
  • IDENT ASSIGN IDENT PLUS IDENT TIMES INT-LIT
    SEMI-COLON

11
The Scanner
  • Also called the Lexer
  • How it works
  • Reads characters from the source program.
  • Groups the characters into lexemes (sequences of
    characters that "go together").
  • Each lexeme corresponds to a token
  • the scanner returns the next token (plus maybe
    some additional information) to the parser.
  • The scanner may also discover lexical errors
    (e.g., erroneous characters).
  • The definitions of what is a lexeme, token, or
    bad character all depend on the source language.

12
Two kinds of Automata
  • Deterministic (DFA)
  • No state has more than one outgoing edge with the
    same label.
  • Non-Deterministic (NFA)
  • States may have more than one outgoing edge with
    same label.
  • Edges may be labeled with ? (epsilon), the empty
    string.
  • The automaton can take an ? epsilon transition
    without looking at the current input character.

13
Regular Expressions to Finite Automata
  • Generating a scanner

NFA
Regular expressions
DFA
Lexical Specification
Table-driven Implementation of DFA
14
BNF
  • Backus-Naur form, Backus-Normal form
  • A set of rules (or productions)
  • Each of which expresses the ways symbols of the
    language can be grouped together
  • Non-terminals are written upper-case
  • Terminals are written lower-case
  • The start symbol is the left-hand side of the
    first production
  • The rules for a CFG are often referred to as its
    BNF

15
Java Identifier Definition
  • Described in the Java specification
  • http//java.sun.com/docs/books/jls/second_edition/
    html/lexical.doc.html44591
  • An identifier is an unlimited-length sequence of
    Java letters and Java digits, the first of which
    must be a Java letter.
  • An identifier cannot have the same spelling
    (Unicode character sequence) as a keyword (3.9),
    Boolean literal (3.10.3), or the null literal
    (3.10.7).

16
Java Identifier Definition

17
Java Integer Literals
  • An integer literal may be expressed in decimal
    (base 10), hexadecimal (base 16), or octal (base
    8)
  • Examples
  • 0 2 0372 0xDadaCafe 1996 0x00FF00FF

(opt means optional)
18
Defining Java Decimal Numerals
  • A decimal numeral is either the single ASCII
    character 0, representing the integer zero, or
    consists of an ASCII digit from 1 to 9,
    optionally followed by one or more ASCII digits
    from 0 to 9, representing a positive integer

19
Defining Floating-Point Literals
  • A floating-point literal has the following
    parts a whole-number part, a decimal point
    (represented by an ASCII period character), a
    fractional part, an exponent, and a type suffix.
    The exponent, if present, is indicated by the
    ASCII letter e or E followed by an optionally
    signed integer.

20
From the Lucene HTML Scanner
21
The Functionality of the Parser
  • Input sequence of tokens from lexical analysis
  • Output parse tree of the program
  • parse tree is generated if the input is a legal
    program
  • if input is an illegal program, syntax errors are
    issued
  • Note
  • Instead of parse tree, some parsers produce
    directly
  • abstract syntax tree (AST) symbol table, or
  • intermediate code, or
  • object code

22
Parser vs. Scanner
Phase Input Output
Scanner String of characters String of tokens
Parser String of tokens Parse tree
23
The Parser
  • Groups tokens into "grammatical phrases",
    discovering the underlying structure of the
    source program.
  • Finds syntax errors.
  • Example
  • position 5
  • corresponds to the sequence of tokens
  • IDENT ASSIGN TIMES INT-LIT SEMI-COLON
  • All are legal tokens, but that sequence of tokens
    is erroneous.
  • Might find some "static semantic" errors, e.g., a
    use of an undeclared variable, or variables that
    are multiply declared.
  • Might generate code, or build some intermediate
    representation of the program such as an
    abstract-syntax tree.

24
What must the parser do?
  • Recognizer not all strings of tokens are
    programs
  • must distinguish between valid and invalid
    strings of tokens
  • Translator must expose program structure
  • e.g., associativity and precedence
  • must return the parse tree
  • We need
  • A language for describing valid strings of tokens
  • context-free grammars
  • (analogous to regular expressions in the scanner)
  • A method for distinguishing valid from invalid
    strings of tokens (and for building the parse
    tree)
  • the parser
  • (analogous to the state machine in the scanner)

25
Parser Example
  • position initial rate 60



position

initial
rate
60
26
The Semantic Analyzer
  • The semantic analyzer checks for (more) "static
    semantic" errors, e.g., type errors.
  • Annotates and/or changes the abstract syntax tree
  • (e.g., it might annotate each node that
    represents an expression with its type).
  • Example with before and after

(float)

(float)

position
(float)
(float)

initial
(float)
rate
(float)
int- to-float()
(float)
60
(int)
27
Intermediate Code Generator
  • The intermediate code generator translates from
    abstract-syntax tree to intermediate code.
  • One possibility is 3-address code.
  • Here's an example of 3-address code for the
    abstract-syntax tree shown above
  • temp1 int-to-float(60)
  • temp2 rate temp1
  • temp3 initial temp2
  • position temp3

28
The Optimizer
int count 0 for (int j0 j lt 25 j)
int temp j 1 count 3
  • Examine the program and rewrite it in ways the
    preserve the meaning but are more efficient.
  • Incredibly complex programs and algorithms
  • Example
  • Move the declaration of temp outside the loop so
    it isnt re-declared every time the loop is
    executed
  • Change 25 to 10 since it is a constant (no need
    to do an expensive multiply at run time)
  • If we removed the line with temp, the program
    might even skip the loop altogether
  • You can see in advance that count ends up 30

29
The Code Generator
  • The code generator generates object code from
    (optimized) intermediate code.
  • LOADF rate,R1
  • MULF 60.0,R1
  • LOADF initial,R2
  • ADDF R2,R1
  • STOREF R1,position

30
Tools
  • Scanner Generator
  • Used to create a scanner automatically
  • Input
  • a regular expression for each token to be
    recognized
  • Output
  • a finite state machine
  • Examples
  • lex or flex (produce C code), or jlex (produce
    java)
  • Compiler Compilers
  • yacc (produces C) or JavaCC (produces Java, also
    has a scanner generator).

31
From the Lucene HTML Parser
32
From the Lucene HTML Parser
33
Graphs / Networks
34
What is a Graph?
35
(No Transcript)
36
(No Transcript)
37
(No Transcript)
38
(No Transcript)
39
(No Transcript)
40
(No Transcript)
41
(No Transcript)
42
(No Transcript)
43
(No Transcript)
44
(No Transcript)
45
(No Transcript)
46
(No Transcript)
47
Next Time
  • Graph Traversal
  • Directed Graphs (digraphs)
  • DAGS
  • Weighted Graphs
Write a Comment
User Comments (0)
About PowerShow.com