Compilers: Principles, Techniques, and Tools - PowerPoint PPT Presentation

1 / 115
About This Presentation
Title:

Compilers: Principles, Techniques, and Tools

Description:

Title: Author: shin Last modified by: shin Created Date: 6/2/1995 9:27:28 PM Document presentation format: Other titles: Times New Roman ... – PowerPoint PPT presentation

Number of Views:3566
Avg rating:3.0/5.0
Slides: 116
Provided by: shin151
Category:

less

Transcript and Presenter's Notes

Title: Compilers: Principles, Techniques, and Tools


1
Compilers Principles, Techniques, and Tools
  • Jing-Shin Chang
  • Department of Computer Science Information
    Engineering
  • National Chi-Nan University

2
Goals
  • What is a Compiler? Why? Applications?
  • How to Write a Compiler by Hands?
  • Theories and Principles behind compiler
    construction - Parsing, Translation Compiling
  • Techniques for Efficient Parsing
  • How to Write a Compiler with Tools

3
Table of Contents
  • 1. Introduction What, Why Apps
  • 2. How A Simple Compiler
  • - What is A Better Typical Compiler
  • 3. Lexical Analysis
  • - Regular Expression and Scanner
  • 4. Syntax Analysis
  • - Grammars and Parsing
  • 5. Top-Down Parsing LL(1)
  • 6. Bottom-Up Parsing LR(1)

4
Table of Contents
  • 7. Syntax-Directed Translation
  • 8. Semantic Processing
  • 9. Symbol Tables
  • 10. Run-time Storage Organization

5
Table of Contents
  • 11. Translation of Special Structures
  • . Modular Program Structures
  • . Declarations
  • . Expressions and Data Structure References
  • . Control Structures
  • . Procedures and Functions
  • 12. General Translation Scheme
  • - Attribute Grammars

6
Table of Contents
  • 13. Code Generation
  • 14. Global Optimization
  • 15. Tools Compiler Compiler

7
What is A Compiler?
  • Functional blocks
  • Forms of compilers

8
The Compiler
  • What is a compiler?
  • A program for translating programming languages
    into machine languages
  • source language gt target language
  • Why compilers?
  • Filling the gaps between a programmer and the
    computer hardware

9
Compiler A Bridge Between PL and Hardware
Applications (High Level Language)
A B C D
Compiler
Operating System
MOV A, C MUL A, D ADD A, B MOV va, A
Hardware (Low Level Language)
Register-based or Stack-based machines
Assembly Codes
10
Typical Machine Instructions Register-based
Machines
A
B C
D E
H L
  • Data Transfer
  • MOV A, B
  • MOV A, mem
  • More IN/OUT, Push, Pop, ...
  • Arithmetic Operation
  • ADD A, C // A A C
  • MUL A, D // A A D
  • More ADC, SUB, SBB, INC
  • Logical Operation
  • AND A, 00001111B // A A 00001111B
  • More OR, NOT, XOR, Shift, Rotate
  • Program Control
  • JMP, JZ, JNZ, Call,
  • Low Level Instructions Features
  • Mostly Simple Binary Operators (using source
    target operands)

Registers of an Intel 8085 processor
11
Typical Machine Instructions Stack-based
Machines

SP SP
SP-1
  • Data Transfer
  • Push A // SP (SP) A
  • Push mem // SP (SP) mem
  • Dup // (SP1) (SP) SP
  • Pop mem // mem (SP) SP--
  • Arithmetic Operation
  • ADD // (SP-1) (SP) (SP-1) SP--
  • MUL // (SP-1) (SP) x (SP-1) SP--
  • Logical Operation
  • Program Control
  • Low Level Instructions Features
  • Mostly Simple Binary Operators
  • Operations are applied to the topmost 2 source
    operands
  • return results to new stack top (destination
    operand)
  • Almost no general purpose registers

12
Compiler (1) - Compilation
MOV A, C MUL A, D ADD A, B MOV va, A
A B C D
Source Program/Code (P.L., Formal Spec.)
Target Program/Code (P.L., Assembly, Machine Code)
Compiler
Error Message
13
Machine Independent Intermediate Instructions
  • Low Level Instructions Features
  • Mostly Simple Binary Operators
  • Result is often save to Accumulator (A register)
  • Not intuitive to programmers
  • Intermediate instructions
  • 3 address codes (for register-based machines)
  • A B C
  • 2 source operands, one destination operand
  • Easy to map to machine instructions (share one
    source destination operand)
  • A A B
  • Stack machine codes (for stack-based machines)

14
Compiler A Bridge Between PL and Hardware
Applications (High Level Language)
A B C D
Compiler
T1 C D T2 B T1 A T2
Operating System
Intermediate Codes
Hardware (Low Level Language)
MOV A, C MUL A, D ADD A, B MOV va, A
Register-based or Stack-based machines
Assembly Codes
15
Compiler (1) - Compilation
MOV A, C MUL A, D ADD A, B MOV va, A
T1 C D T2 B T1 A T2
A B C D
Source Program/Code (P.L., Formal Spec.)
Target Program/Code (P.L., Assembly, Machine Code)
Compiler
Error Message
16
Compiler (2a) Execution
Running the compiled codes
Input
Output
Target Code
(in Real Machine)
Target code (compiled)
Loader
(load into Real Machine)
17
Compiler (2b) Compile Go
Two working phases in two passes
Source Program
Error Message
Compiler
Target Code
Output
Input
(in Real Machine)
  • Compiler Two independent phases to complete the
    work
  • (1) Compilation Phase Source to Target
    compilation
  • (2) Execution Phase run compiled codes
    respond to input
  • produce output

18
Compiler (2c) compile go
Two working phases in two passes
Source program ( executable Target code)
Compiler (Loader)
Output
Input
(target loaded into Real Machine)
  • Compiler Two independent phases to complete the
    work
  • (1) Compilation Phase Source to Target
    compilation
  • (2) Execution Phase run compiled codes
    respond to input
  • produce output

19
Interpreter (1)
Source program
Output
Interpreter
Input
Error Message
  • Interpreter One single pass to complete the
    two-phases work
  • Each source statement is Compiled and Executed
    subsequently
  • The next statement is then handled in the same
    way

20
Interpreter (2)
  • Compile and then execute for each incoming
    statements
  • Do not save compiled codes in executable files
  • Save storage
  • Re-compile the same statements if loop back
  • Slower
  • Detect (compilation runtime) errors as one
    occurs during the execution time
  • ? Compiler Detect syntax/semantic errors
    (compilation errors) during compilation time

21
Hybrid Compiler Interpreter?
Source program
Error Message
Compiler
Intermediate program
Interpreter
Output
Input
(with/without JIT)
22
Hybrid Compiler Interpreter?
Source program
  • Intermediate program
  • without syntax/semantic errors
  • machine independent
  • Interpreter
  • do not interpret high level source
  • but compiled low level code
  • easy to interpret efficient

Compiler
Intermediate program
Interpreter
Output
Input
(with/without JIT)
23
Hybrid Method Virtual Machine
Source program
Translator
(Compiler)
Intermediate program
Virtual Machine (VM)
Output
Input
(Interpreter with/without JIT)
24
Example Java Compiler Java VM
Java program
(app.java)
Java Compiler
(Javac)
(app.class)
Java Bytecodes
Java Virtual Machine
Output
Input
(Interpreter with/without JIT)
25
Hybrid Method Virtual Machine
  • Compile source program into a platform
    independent code
  • E.g., Java gt Bytecodes (stack-based
    instructions)
  • Execute the code with a virtual machine
  • High portability The platform independent code
    can be distributed on the web, downloaded and
    executed in any platform that had VM
    pre-installed
  • Good for cross-platform applications

26
Just-in-time (JIT) Compilation
  • Compile a new statement (only once) as it comes
    for the first time
  • And save the compiled codes
  • Executed by virtual/real machine
  • Do not re-compile as it loop back
  • Example
  • Java VM (simple Interpreter version, without
    JIT) high penalty in performance due to
    interpretation
  • Java VM JIT improved by the order of a factor
    of 10
  • JIT translate bytecodes during run time to the
    native target machine instruction set

27
Comparison of Different Compilation-and-Go Schemes
  • Normal Compilers
  • Will generate codes for all statements whether
    they will be executed or not
  • Separate the compilation phase and execution
    phase into two different phrases
  • Syntax semantic errors are detected at
    compilation time
  • Interpreters and JIT Compilers
  • Can generate codes only for statements that are
    really executed
  • Will depend on your input different execution
    flows mean different sets of executed codes
  • Interpreter Syntax semantic errors are
    detected at run/execution time
  • JIT vs. Simple Interpreter
  • JIT save the target machine codes
  • Can be re-used, and compiled at most once
  • Interpreter do not save target machine codes
  • Compiled more than once

28
Register-Based Virtual Machine for Android Phone
Dalvik VM
  • Java VM (JVM) Stack-based Instruction Set
  • Normally less efficient than RISC or CISC
    instructions
  • Limited memory organization
  • Requires too many swap and copy operations

29
Register-Based Virtual Machine for Android Phone
Dalvik VM
  • Dalvik VM (for Android OS) Register-based
    Instruction Set
  • Smaller size
  • Better memory efficiency
  • Good for phone and other embedded systems
  • Generation and Execution of Dalvik byte codes
  • Compiled/Translated from Java byte code into a
    new byte code
  • app.java (Java source)
  • javac (Java Compiler)gt app.class
    (executable by JVM)
  • dx (in Android SDK tool) gt app.dex (Dalvik
    Executable)
  • compression gt apps.apk (Android
    Application Package)
  • Dalvik VM gt (execution)

30
How To Construct A Compiler
  • Language Processing Systems
  • High-Level and Intermediate Languages
  • Processing Phases
  • Quick Review on Syntax Semantics
  • Processing Phases in Detail
  • Structure of Compilers

31
Source Program
Modified Source Program
Compiler
A language-Processing System
Target Assembly Program
Assembler
Relocatable Machine Code
Target Machine Code
32
Programming Languages vs. Natural Languages
  • Natural languages for communication between
    native speakers of the same or different
    languages
  • Chinese, English, French, Japanese
  • Programming languages for communication between
    programmers and computers
  • Generic High-Level Programming Languages
  • Basic, Fortran, COBOL, Pascal, C/C, Java
  • Typesetting Languages
  • TROFF (TBL, EQN, PIC), La/Tex, PostScript
  • Markup Language -- Structured Documents
  • SGML, HTML, XML, ...
  • Script Languages
  • Csh, bsh, awk, perl, python, javascript, asp,
    jsp, php

33
Machine Independent Intermediate Instructions
  • Low Level Instructions Features
  • Mostly Simple Binary Operators
  • Result is often save to Accumulator (A register)
  • Not intuitive to programmers
  • Intermediate instructions
  • 3 address codes (for register-based machines)
  • A B C
  • 2 source operands, one destination operand
  • Easy to map to machine instructions (share one
    source destination operand)
  • A A B
  • Stack machine codes (for stack-based machines)

34
Compiler A Bridge Between PL and Hardware
Applications (High Level Language)
A B C D
Compiler
T1 C D T2 B T1 A T2
Operating System
Intermediate Codes
Hardware (Low Level Language)
MOV A, C MUL A, D ADD A, B MOV va, A
Register-based or Stack-based machines
Assembly Codes
35
Compiler with Intermediate Codes
MOV A, C MUL A, D ADD A, B MOV va, A
T1 C D T2 B T1 A T2
A B C D
Source Program/Code (P.L., Formal Spec.)
Target Program/Code (P.L., Assembly, Machine Code)
Compiler
Error Message
36
float position, initial, rate position initial
rate 60
Tokens
3-address codes, or Stack machine codes
Typical Phases of a Compiler
Parse Tree or Syntax Tree
Optimized codes
Syntax Tree or Annotated Syntax Tree
Assembly (or Machine) Codes
37
Analysis-Synthesis Model of a Compiler
  • Analysis Program gt Constituents gt I.R.
  • Lexical Analysis linear gt token
  • Syntax Analysis hierarchical, nested gt tree
  • Identify relations/actions among tokens e.g.,
    add(b, mult(c,d))
  • Semantic Analysis check legal constraints /
    meanings
  • By examining attributes associated with tokens
    relations
  • Synthesis I.R. gt I.R. gt Target Language
  • Intermediate Code Generation
  • generate intermediate representation (I.R.) from
    syntax
  • Code Optimization generate better equivalent IR
  • machine independent machine dependent
  • Code Generation

38
Typical Modules of a Compiler
Annotated Syntax Tree
39
float position, initial, rate position initial
rate 60
Tokens
3-address codes, or Stack machine codes
Typical Phases of a Compiler
Parse Tree or Syntax Tree
Optimized codes
Syntax Tree or Annotated Syntax Tree
Assembly (or Machine) Codes
40
How To Construct A Compiler
  • Language Processing Systems
  • High-Level and Intermediate Languages
  • Processing Phrases
  • Quick Review on Syntax Semantics
  • Processing Phrases in Detail
  • Structure of Compilers

41
Syntax Analysis Structure
  • Syntax Analysis (Parsing) match input tokens
    against a grammar of the language
  • To ensure that the input tokens form a legal
    sentence (statement)
  • To build the structure representation of the
    input tokens
  • So the structure can be used for translation (or
    code generation)
  • Knowledge source
  • Grammar in CFG (Context-Free Grammar) form
  • Additional semantic rules for semantic checks and
    translation (in later phases)

id1 id2 id3 60
Grammar
Syntax Analysis
S ? id e S ? e ? id t e ? t ? id n t
?
Parse Tree (Concrete syntax tree)
42
Grammar Context Free Grammar
43
Context Free Grammar (CFG)Specification for
Structures Constituency
  • Parse Tree graphical representation of structure
  • root node (S) a sentential level structure
  • internal nodes constituents of the sentence
  • arcs relationship between parent nodes and their
    children (constituents)
  • terminal nodes surface forms of the input
    symbols (e.g., words)
  • alternative representation bracketed notation
  • e.g., I saw the girl in the park
  • Example

44
Parse Tree I saw the girl in the park
45
CFG Components
  • CFG formal specification of parse trees
  • G ?, N, P, S
  • ? terminal symbols
  • N non-terminal symbols
  • P production rules
  • S start symbol
  • ? terminal symbols
  • the input symbols of the language
  • programming language tokens (reserved words,
    variables, operators, )
  • natural languages words or parts of speech
  • pre-terminal parts of speech (when words are
    regarded as terminals)
  • N non-terminal symbols
  • groups of terminals and/or other non-terminals
  • S start symbol the largest constituent of a
    parse tree
  • P production (re-writing) rules
  • form a ? ß (a non-terminal, ß string of
    terminals and non-terminals)
  • meaning a re-writes to (consists of, derived
    into)ß, or ßreduced to a
  • start with S-productions (S ? ß)

46
CFG Example Grammar
  • Grammar Rules
  • S ? NP VP
  • NP ? Pron Proper-Noun Det Norm
  • Norm ? Noun Norm Noun
  • VP ? Verb Verb NP Verb NP PP Verb PP
  • PP ? Prep NP
  • S sentence, NP noun phrase, VP verb phrase
  • Pron pronoun
  • Det determiner, Norm Norminal
  • PP prepositional phrase, Prep preposition
  • Lexicon (in CFG form)
  • Noun ? girl park desk
  • Verb ? like want is saw walk
  • Prep ? by in with for
  • Det ? the a this these
  • Pron ? I you he she him
  • Proper-Noun ? IBM Microsoft Berkeley

47
Syntax vs. Semantic Analyses
  • Syntax
  • How the input tokens look like? Do they form a
    legal structure?
  • Analysis of relationship between elements
  • e.g., operator-operands relationship
  • Semantic
  • What they mean? And, thus, how they act?
  • Analysis of detailed attributes of elements and
    check constraints over them under the given
    syntax
  • Not all knowledge between elements can be
    conveniently represented by a simple syntactic
    structure. Various kinds of attributes are
    associated with sub-structures in the given syntax

48
Syntax vs. Semantic Analyses
  • Examples
  • int a, b, c ,d float f char s1, s2
  • a b c d
  • a b f d // OK, but not strictly
    right
  • a b s1 s2 // BAD is undefined for
    strings
  • a b s1 3 // OK? if properly defined
  • All the above statements have the same look
  • Convenient to represent them with the same
    syntactic structure (grammar/production rules)
  • But Semantically
  • Not all of them are meaningful (?? string
    string ??)
  • You have to check their other attributes for
    meanings
  • Not all meaningful statements will mean/act the
    same and have the same codes ( int int ? int
    float ? string int)
  • You have to generate different codes according to
    other attributes of the tokens, since
    instructions are limited
  • E.g., INT and FLOAT additions may use different
    machine instructions, like ADD and ADDF
    respectively.

semantic analyzer
49
Semantic Analysis Attributes
Parse Tree (Concrete Syntax Tree)
Semantic checks abstraction
Semantic Rules Assoc. with Grammar Productions
Semantic Analysis
Syntax Tree (Abstract Syntax Tree)


id1

id2
id3
60
50
How To Construct A Compiler
  • Language Processing Systems
  • High-Level and Intermediate Languages
  • Processing Phrases
  • Quick Review on Syntax Semantics
  • Processing Phrases in Detail
  • Structure of Compilers

51
Symbol Table Management
  • Symbols
  • Variable names, procedure names, constant
    literals (3.14159)
  • Symbol Table
  • A record for each name describing its attributes
  • Managing Information about names
  • Variable attributes
  • Type, register/storage allocated, scope
  • Procedure names
  • Number and types of arguments
  • Method of argument passing
  • By value, address, reference

52
1 Lexical Analysis Tokenization
I saw the girls I see the girls
final initial rate 60 f i
r 60
Both looks the same. So you want to represent
them with the same normalized token string, and
hide detailed features as additional attributes.
Lexical Analysis
I(1psg) see (ed) the girl (s) I(1psg) see
(prs) the girl (s)
id1 id2 id3 60
1 I I 1psg
2 see saw ed
3 the the
4 girl girls 3ppl s
1 id1 final float R2
2 id2 initial float R1
3 id3 rate float
4 const1 60 const 60.0
53
2 Syntax Analysis Structure
I see (ed) the girl (s)
id1 id2 id3 60
Grammar
Syntax Analysis
Normalized tokens have the same parse/syntax
tree whether they were see/saw and
girl/girls.
Parse Tree (Concrete syntax tree)
54
Syntax Analysis Structure
I see (ed) the girl (s)
id1 id2 id3 60
Syntax Analysis
Sentence
NP
verb
NP
I see (ed) the girl (s)
Syntax Tree (Abstract syntax tree)
55
Semantic Analysis Attributes
Syntax Tree (Abstract Syntax Tree)
Semantic Analysis
Syntax Tree (Abstract Syntax Tree)
56
3 Semantic Analysis Attributes
Semantic checks abstraction
Parse Tree (Concrete Syntax Tree)
Semantic Rules Assoc. with Grammar Productions
Semantic Analysis
Sentence
Syntax Tree (Abstract Syntax Tree)
NP.subject
verb
NP.object
I see (ed) the girl (s)
57
Semantic Analysis Attributes
Semantic checks abstraction
Parse Tree (Concrete Syntax Tree)
Semantic Rules Assoc. with Grammar Productions
Semantic Analysis
Syntax Tree (Abstract Syntax Tree)


id1

id2
id3
60
58
3 Semantic Analysis Attributes
Parse Tree (Concrete Syntax Tree)
Semantic checks abstraction
Semantic Rules Assoc. with Grammar Productions
Semantic Analysis
Syntax Tree (Abstract Syntax Tree)


id1

id2
id3
60
59
3 Semantic Analysis Attributes
Parse Tree (Concrete Syntax Tree)
Semantic checks abstraction
Semantic Rules Assoc. with Grammar Productions
Semantic Analysis
Syntax Tree (Abstract Syntax Tree)


id1

id2
id3
60
60
Semantic Checking
  • Semantic Constraints
  • Agreement (somewhat syntactic)
  • Subject-Verb I have, she has/had, I do have, she
    does not
  • NP Quantifier-noun a book, two books
  • Selectional Constraint
  • Kill ? Animate
  • Kiss ? Animate

abstraction
61
Semantic Checking
  • Semantic Constraints
  • Agreement (somewhat syntactic)
  • Subject-Verb I have, she has/had, I do have, she
    does not
  • NP Quantifier-noun a book, two books
  • Selectional Constraint
  • Kill ? Animate
  • Kiss ? Animate

semantic checking
Seeed(I, the girls)
(semantically meaningful)
Kill/Kiss (John, the Stone)
(semantically meaningless unless the Stone
refers to an animate entity)
62
Parse Tree vs. Syntax Tree
  • Parse Tree (aka concrete syntax tree)
  • Tree concrete representation drawn according to a
    grammar
  • For validating correctness of syntax of input
  • For easy parsing (or fitting constraints of
    parsing algorithm)
  • Normally constructed incrementally during parsing
  • Syntax Tree (aka abstract syntax tree)
  • Tree logical representation that characterize the
    abstract relationships between constituents
  • For representing semantic relationships
    semantic checking
  • Normalizing various parse trees of the same
    meaning (semantics)
  • May ignore non-essential syntactic details
  • Not always the same as parse tree
  • May be constructed in parallel with the parse
    tree during parsing
  • Or converted from parse tree after syntactic
    parsing
  • Annotated Syntax Tree (AST)
  • Syntax Tree with annotated attributes

63
Parse Tree vs. Syntax Tree
Parse Tree for G1
  • Parse Tree (depend on grammar)
  • Input T T T
  • G1 T (( T) ( T) )
  • E ? T R
  • R ? T R
  • R ? ltnullgt
  • G2 ((T) T) T
  • E ? E T
  • E ? T
  • Syntax Tree
  • Abstract representation for syntax defined by
    G1/G2
  • Use operation as parent nodes and operands as
    children nodes
  • Operation-operand relationship Easy for
    instruction selection in code generation (e.g.,
    ADD R1, R2)

Parse Tree for G2
Syntax Tree (independent of G1 or G2)
64
4 Intermediate Code Generation
Attribute evaluation (assembly codes are
attributes for code generation)
Action(anim,anim)
see (ed)
anim
anim
subject
object
I the girl (s)
Intermediate Code Generation
logic form
3-address codes
temp1 i2r ( 60 ) temp2 id3 temp1 temp3
id2 temp2 id1 temp3
Seeed(I, the girls)
65
4 Intermediate Code Generation
Attribute evaluation (assembly codes are
attributes for code generation)
Action(anim,anim)
anim
anim
Action
Intermediate Code Generation
logic form
3-address codes
temp1 i2r ( 60 ) temp2 id3 temp1 temp3
id2 temp2 id1 temp3
Seeed(I, the girls)
66
Syntax-Directed Translation (1)
  • Translation from input to target can be regarded
    as attribute evaluation.
  • Evaluate attributes of each node, in a well
    defined order, based on the particular piece of
    sub-tree structure (syntax) wherein the
    attributes are to be evaluated.
  • Attributes the particular properties associated
    with a tree node (a node may have many
    attributes)
  • Abstract representation of the sub-tree rooted at
    that node
  • The attributes of the root node represent the
    particular properties of the whole input
    statement or sentence.
  • E.g., value associated with a mathematic
    sub-expression
  • E.g., machine codes associated with a
    sub-expression
  • E.g., language translation associated with a
    sub-sentence

67
Syntax-Directed Translation (2)
  • Synthesis Attributes
  • Attributes that can be evaluated based on the
    attributes of children nodes
  • E.g., value of math. expression can be acquired
    from the values of sub-expressions (and the
    operators being applied)
  • a b c d
  • (? a.val b.val tmp.val where tmp.val c.val
    d.val)
  • girls girl s
  • (? tr.girls tr.girl tr.s ???? ???)
  • Inherited Attributes
  • Attributes evaluatable from parent and/or sibling
    nodes
  • E.g., data type of a variable can be acquired
    from its left-hand side type declaration or from
    the type of its left-hand side brother
  • int a, b, c (? a.type INT b.type a.type
    )

68
Syntax-Directed Translation (3)
  • Attribute evaluation order
  • Any order that can evaluate the attribute AFTER
    all its dependent attributes are evaluated will
    result in correct evaluation.
  • General topological order
  • Analyze the dependency between attributes and
    construct an attribute tree or forest
  • Evaluate the attribute of any leave node, and
    mark it as evaluated, thus logically remove it
    from the attribute tree or forest
  • Repeat for any leave nodes that have not been
    marked, until no unmarked node

69
5 Code OptimizationNormalization
temp1 i2r ( 60 ) temp2 id3 temp1 temp3
id2 temp2 id1 temp3
Normalization into better equivalent form
(optional)
Was_Killed(Bill, John)
Seeed(I, the girls)
Unify passive/active voices
Code Optimization
Killed(John, Bill)
Seeed(I, the girls)
temp1 id3 60.0 id1 id2 temp1
70
6 Code Generation
temp1 id3 60.0 id1 id2 temp1
Seeed(I, the girls)
Selection of target words order of phrases
Code Generation
Selection of usable codes order of codes
Allocation of available registers
movf id3, r2 mulf 60.0, r2 movf
id2, r1 addf r2, r1 movf r1, id1
Lexical ?? ? (?, ?? ?)
Structural ? ?? ?? ? ?
71
Objectives of Optimizing Compilers
  • Correct codes preserve meaning
  • Better performance
  • Maximum Execution Efficiency
  • Minimum Code Size
  • Embedded systems
  • Minimizing Power Consumptions
  • Mobile devices
  • Typically, faster execution also implies lower
    power
  • Reasonable compilation time
  • Manageable engineering and maintenance efforts

72
Optimization for Computer Architectures (1)
  • Parallelism
  • Instruction level multiple operations are
    executed simultaneously
  • Processor check dependency in sequential
    instructions, issue them in parallel
  • Hardware scheduler change order of instruction
  • Compilers rearrange instructions to make
    instruction level parallelism more effective
  • Instruction set supports
  • Very long Instruction word issues multiple
    operations in parallel
  • Instructions that can operate on Vector data at
    the same time
  • Compilers generate codes for such machine from
    sequential codes
  • Processor level different threads of the same
    application are run on different processors
  • Multiprocessors multithreaded codes
  • Programmer write multithreaded codes, vs
  • Compiler generate parallel codes automatically

73
Optimization for Computer Architectures (2)
  • Memory Hierarchies
  • No storage that is both fast and large
  • Registers (tens hundreds bytes), caches (KMB),
    main/physical memory (MGB), secondary/virtual
    memory (hard disks) (GTB)
  • Using registers effectively is probably the
    single most important problem in optimizing a
    program
  • Cache-management by hardware is not effective in
    scientific code that has large data structures
    (arrays)
  • Improve effectiveness of memory hierarchies
  • By changing layout of data, or
  • Changing the order of instructions accessing the
    data
  • Improve effectiveness of instruction cache
  • Change the layout of codes

74
How To Construct A Compiler
  • Language Processing Systems
  • High-Level and Intermediate Languages
  • Processing Phrases
  • Quick Review on Syntax Semantics
  • Processing Phrases in Detail
  • Structure of Compilers

75
Structure of a Compiler
  • Front End Source Dependent
  • Lexical Analysis
  • Syntax Analysis
  • Semantic Analysis
  • Intermediate Code Generation
  • (Code Optimization machine independent)
  • Back End Target Dependent
  • Code Optimization
  • Target Code Generation

76
Structure of a Compiler
Fortran
Pascal
C
Intermediate Code
MIPS
SPARC
Pentium
77
History
  • 1st Fortran compiler 1950s
  • efficient? (compared with assembly program)
  • not bad, but much easier to write programs
  • high-level languages are feasible.
  • 18 man-year, ad hoc structure
  • Today, we can build a simple compiler in a few
    month.
  • Crafting an efficient and reliable compiler is
    still challenging.

78
Cousins of the Compiler
  • Preprocessors macro definition/expansion
  • Interpreters
  • Compiler vs. interpreter vs. just-in-time
    compilation
  • Assemblers 1-pass / 2-pass
  • Linkers link source with library functions
  • Loaders load executables into memory
  • Editors editing sources (with/without syntax
    prediction)
  • Debuggers symbolically providing stepwise trace
  • Profilers gprof (call graph and time analysis)
  • Project managers IDE
  • Integrated Development Environment
  • Deassemblers, Decompilers low-level to
    high-level language conversion

79
Applications of Compilation Techniques
80
Applications of Compilation Techniques
  • Virtually any kinds of Programming Languages and
    Specification Languages with Regular and
    Well-defined Grammatical Structures will need a
    kind of compiler (or its variant, or a part of
    it) to analyze and then process them.

81
Applications of Lexical Analysis
  • Text/Pattern Processing
  • grep get lines with specified pattern
  • Ex grep From /var/spool/mail/andy
  • sed stream editor, editing specified patterns
  • Ex ls .JPG sed s/JPG/jpg/
  • tr simple translation between patterns (e.g.,
    uppercases to lowercases)
  • Ex tr a-z A-Z lt mytext gt mytext.uc
  • AWK pattern-action rule processing
  • pattern processing based on regular expression
  • Ex awk '1John"countENDprint count ' lt
    Students.txt

82
Applications of Lexical Analysis
  • Search Engines/Information Retrieval
  • full text search, keyword matching, fuzzy match
  • Database Machine
  • fast matching over large database
  • database filter
  • Fast Multiple Matching Algorithms
  • Optimized/specialized lexical analyzers (FSA)
  • Examples KMP, Boyer-Moore (BM),

83
Applications Syntax Analysis
  • Structured Editor/Word Processor
  • Integrated Develop Environment (IDE)
  • automatic formatting, keyword insertion
  • Incremental Parser vs. Full-blown Parsing
  • incremental patching analysis made by
    incremental changes, instead of re-parsing or
    re-compiling
  • Pretty Printer beautify nested structures
  • cb (C-beautifier)
  • indent (an even more versatile C-beautifier)

84
Applications Syntax Analysis
  • Static Checker/Debugger lint
  • check errors without really running, e.g.,
  • statement not reachable
  • used before defined

85
Application of Optimization Techniques
  • Data flow analysis
  • Software testing
  • Locating errors before running (static checking)
  • Locate errors along all possible execution paths
  • not only on test data set
  • Type Checking
  • Dereferncing null or freed pointers
  • Dangerous user supplied strings
  • Bound Checking
  • Security vulnerability buffer over-run attack
  • Tracking values of pointers across procedures
  • Memory management
  • Garbage collection

86
Applications of Compilation Techniques
  • Pre-processor Macro definition/expansion
  • Active Webpages Processing
  • Script or programming languages embedded in
    webpages for interactive transactions
  • Examples JavaScript, JSP, ASP, PHP
  • Compiler Apps expansion of embedded statements,
    in addition to web page parsing
  • Database Query Language SQL

87
Applications of Compilation Techniques
  • Interpreter
  • no pre-compilation
  • executed on-the-fly
  • e.g., BASIC
  • Script Languages C-shell, Perl
  • Function for batch processing multiple
    files/databases
  • mostly interpreted, some pre-compiled
  • Some interpreted and save compiled codes

88
Applications of Compilation Techniques
  • Text Formatter
  • Troff, LaTex, Eqn, Pic, Tbl
  • VLSI Design Silicon Compiler
  • Hardware Description Languages
  • variables gt control signals / data
  • Circuit Synthesis
  • Preliminary Circuit Simulation by Software

89
Applications of Compilation Techniques
  • VLSI Design

90
Advanced Applications
  • Natural Language Processing
  • advanced search engines retrieve relevant
    documents
  • more than keyword matching
  • natural language query
  • information extraction
  • acquire relevant information (into structured
    form)
  • text summarization
  • get most brief relevant paragraphs
  • text/web mining
  • mining information rules from text/web

91
Advanced Applications
  • Machine Translation
  • Translating a natural language into another
  • Models
  • Direct translation
  • Transfer-Based Model
  • Inter-lingua Model
  • Transfer-Based Model
  • Analysis-Transfer-Generation (or Synthesis) model

92
Tools for Compiler Construction
93
Tools Automatic Generation of Lexical Analyzers
and Compilers
  • Lexical Analyzer Generator LEX
  • Input Token Pattern specification (in regular
    expression)
  • Output a lexical analyzer
  • Parser Generator YACC
  • compiler-compiler
  • Input Grammar Specification (in context-free
    grammar)
  • Output a syntax analyzer (aka parser)

94
Tools
  • Syntax Directed Translation engines
  • translations associated with nodes
  • translations defined in terms of translations of
    children
  • Automatic code generation
  • translation rules
  • template matching
  • Data flow analyses
  • dependency of variables constructs

95
Programming Languages
  • Issues about Modern PLs
  • Module programming Parameter passing
  • Nested modules Scopes
  • Static dynamic allocation

96
Programming Language Basics
  • Static vs. Dynamic Issues or Policies
  • Static determined at compile time
  • Dynamic determined at run time
  • Scopes of declaration
  • Region in which the use of x refer to a
    declaration of x
  • Static Scope (aka lexical scope)
  • Possible to determine the scope of declaration by
    looking at the program
  • C, Java (and most PL)
  • Delimited by block structures
  • Dynamic scope
  • At run time, the same use of x could refer to any
    of several declarations of x.

97
Programming Language Basics
  • Variable declaration
  • Static variables
  • Possible to determine the location in memory
    where the declared variable can be found
  • Public static int x // C
  • Only one copy of x, can be determined at compile
    time
  • Global declarations and declared constants can
    also be made static
  • Dynamic variables
  • Local variables without the static keyword
  • Each object of the class would have its own
    location where x would be held.
  • At run time, the same use of x in different
    objects could refer to any of several different
    locations.

98
Programming Language Basics
  • Parameter Passing Mechanisms
  • called by value
  • make a copy of physical value
  • called by reference
  • make a copy of the address of a physical object
  • call by name (Algol 60)
  • callee executed as if the actual parameter were
    substituted literally for the formal parameter
    in the code of the callee
  • macro expansion of formal parameter into actual
    parameter

99
Formal Languages
100
Languages, Grammars and Recognition Machines
I saw a girl in the park
Language
define
accept
generate
Grammar (expression)
Parser (automaton)
construct
Parsing Table
S? NP VP NP? pron det n
S? NP VP NP? pron det n
101
Languages
  • Alphabet - any finite set of symbols 0, 1
    binary alphabet
  • String - a finite sequence of symbols from an
    alphabet 1011 a string of length 4 ? the
    empty string
  • Language - any set of strings on an
    alphabet 00, 01, 10, 11 the set of strings of
    length 2 ? the empty set

102
Terms for Parts of a String
  • string banana
  • (proper) prefix ?, b, ba, ban, ..., banana
  • (proper) suffix ?, a, na, ana, ..., banana
  • (proper) substring ?, b, a, n, ba, an, na,
    ..., banana
  • subsequence (including non-consecutive ones)
    ?, b, a, n, ba, bn, an, aa, na, nn, ..., banana
  • sentence a string in the language

103
Operations on Strings
  • concatenation x dog y house xy doghouse
  • exponentiation s0 ? s1 s s2 ss

104
Operations on Languages
  • Union of L and M, L ? M L ? M s s ? L or
    s ? M
  • Concatenation of L and M, LM LM st s ? L
    and t ? M
  • Kleene closure of L, L L
  • Positive closure of L, L L

105
Grammars
  • The sentences in a language may be defined by a
    set of rules called a grammar L 00, 01, 10,
    11
  • (the set of binary digits of
    length 2)
  • G (01)(01)
  • Languages of different degree of regularity can
    be specified with grammar of different
    expressive powers
  • Chomsky Hierarchy
  • Regular Grammar lt Context-Free Grammar lt
    Context-Sensitive Grammar lt Unrestricted

106
Automata
  • An acceptor/recognizer of a language is an
    automaton which determines if an input string is
    a sentence in the language
  • A transducer of a language is an automaton which
    determines if an input string is a sentence in
    the language, and may produce strings as output
    if it is in the language
  • Implementation state transition functions
    (parsing table)

107
Transducer
language L1
language L2
accept
translation
Define / Generate
Define / Generate
automaton
grammar G1
grammar G2
construct
108
Meta-languages
  • Meta-language a language used to define another
    language Different meta-languages will be
    used to define the various components of
    a programming language so that these
    components can be analyzed automatically

109
Definition of Programming Languages
  • Lexical tokens regular expressions
  • Syntax context free grammars
  • Semantics attribute grammars
  • Intermediate code generation attribute
    grammars
  • Code generation tree grammars

110
Implementation of Programming Languages
  • Regular expressions finite automata, lexical
    analyzer
  • Context free grammars pushdown automata,
    parser
  • Attribute grammars attribute evaluators, type
    checker and intermediate code generator
  • Tree grammars finite tree automata, code
    generator

111
Appendix Machine Translation
112
Machine Translation (Transfer Approach)
Analysis
Transfer
Synthesis
SL Text
SL IR
TL IR
TL Text
SL Dictionaries Grammar
TL Dictionaries Grammar
SL-TL Dictionaries Transfer Rules
IR Intermediate Representation
  • Analysis is target independent, and
  • Generation (Synthesis) is source independent

113
ExampleMiss Smith put two books on this dining
table.
  • Analysis
  • Morphological and Lexical Analysis
  • Part-of-speech (POS) Tagging
  • n. Missn. Smithv. put
    (ed)q. twon. book (s)p. ond.
    thisn. dining table.

114
ExampleMiss Smith put two books on this dining
table.
  • Syntax Analysis

S
VP
NP
V
NP
PP
Miss Smith put(ed) two book(s) on this dining
table
115
ExampleMiss Smith put two books on this dining
table.
  • Transfer
  • (1) Lexical Transfer Miss ??
    Smith ??? put (ed) ? two ?
    book (s) ? on ??? this ?
    dining table ??

116
ExampleMiss Smith put two books on this dining
table.
  • Transfer
  • (2) Phrasal/Structural Transfer
    ?????????????? ??????????????

117
ExampleMiss Smith put two books on this dining
table.
  • Generation Morphological Structural
  • ?????????????? ???????(?)???(?)????
  • ?????(?)?(?)????(?)????

118
(No Transcript)
119
position initial rate 60
lexical analyzer
Aho 86
id1 id2 id3 60
syntax analyzer

id1

SYMBOL TABLE

id2
position
initial
rate

1
id3
60
2
semantic analyzer
3

id1

4

id2
id3
inttoreal 60
120
C
intermediate code generator
Aho 86
temp1 inttoreal (60) temp2 id3
temp1 temp3 id2 temp2 id1 temp3
code optimizer
temp1 id3 60.0 id1 id2 temp1
code generator
Binary Code
121
Detailed Steps (1) Analysis
  • Text Pre-processing (separating texts from tags)
  • Clean up garbage patterns (usually introduced
    during file conversion)
  • Recover sentences and words (e.g., ltBgtClt/Bgt
    omputer)
  • Separate Processing-Regions from
    Non-Processing-Regions (e.g., File-Header-Sections
    , Equations, etc.)
  • Extract and mark strings that need special
    treatment (e.g., Topics, Keywords, etc.)
  • Identify and convert markup tags into internal
    tags (de-markup however, markup tags also
    provide information)
  • Discourse and Sentence Segmentation
  • Divide text into various primary processing units
    (e.g., sentences)
  • Discourse Cue Phrases
  • Sentence mainly classify the type of Period
    and Carriage Return in English (sentence
    stops vs. abbreviations/titles)

122
Detailed Steps (2) Analysis (Cont.)
  • Stemming
  • English perform morphological analysis (e.g.,
    -ed, -ing, -s, -ly, re-, pre-, etc.) and Identify
    root form (e.g., got ltgetgt, lay ltlie/laygt, etc.)
  • Chinese mainly detect suffix lexemes (e.g., ???,
    ???, etc.)
  • Text normalization Capitalization, Hyphenation,
  • Tokenization
  • English mainly identify split-idiom (e.g., turn
    NP on) and compound
  • Chinese Word Segmentation (e.g., ?? ?? ??)
  • Regular Expression numerical strings/expressions
    (e.g., twenty millions), date, (each being
    associated with a specific type)
  • Tagging
  • Assign Part-of-Speech (e.g., n, v, adj, adv,
    etc.)
  • Associated forms are basically independent of
    languages starting from this step

123
Detailed Steps (3) Analysis (Cont.)
  • Parsing
  • Decide suitable syntactic relationship (e.g.,
    PP-Attachment)
  • Decide Word-Sense
  • Decide appropriate lexicon-sense (e.g.,
    River-Bank, Money-Bank, etc.)
  • Assign Case-Label
  • Decide suitable semantic relationship (e.g.,
    Patient, Agent, etc.)
  • Anaphora and Antecedent Resolution
  • Pronoun reference (e.g., he refers to the
    president)

124
Detailed Steps (4) Analysis (Cont.)
  • Decide Discourse Structure
  • Decide suitable discourse segments relationship
    (e.g., Evidence, Concession, Justification, etc.
    Marcu 2000.)
  • Convert into Logical Form (Optional)
  • Co-reference resolution (e.g., president refers
    to Bill Clinton), scope resolution (e.g.,
    negation), Temporal Resolution (e.g., today, last
    Friday), Spatial Resolution (e.g., here, next),
    etc.
  • Identify roles of Named-Entities (Person,
    Location, Organization), and determine IS-A (also
    Part-of) relationship, etc.
  • Mainly used in inference related applications
    (e.g., QA, etc.)

125
Detailed Steps (5) Transfer
  • Decide suitable Target Discourse Structure
  • For example Evidence, Concession, Justification,
    etc. Marcu 2000.
  • Decide suitable Target Lexicon Senses
  • Sense Mapping may not be one-to-one (sense
    resolution might be different in different
    languages, e.g. snow has more senses in Eskimo)
  • Sense-Token Mapping may not be one-to-one
    (lexicon representation power might be different
    in different languages, e.g., DINK, ?, etc).
    It could be 2-1, 1-2, etc.
  • Decide suitable Target Sentence Structure
  • For example verb nominalization, constitute
    promotion and demotion (usually occurs when
    Sense-Token-Mapping is not 1-1)
  • Decide appropriate Target Case
  • Case Label might change after the structure has
    been modified
  • (Example) verb nominalization that you
    (AGENT) invite me ? your (POSS) invitation

126
Detailed Steps (6) Generation
  • Adopt suitable Sentence Syntactic Pattern
  • Depend on Style (which is the distributions of
    lexicon selection and syntactic patterns adopted)
  • Adopt suitable Target Lexicon
  • Select from Synonym Set (depend on style)
  • Add de (Chinese), comma, tense, measure
    (Chinese), etc.
  • Morphological generation is required for
    target-specific tokens
  • Text Post-processing
  • Final string substitution (replace those markers
    of special strings)
  • Extract and export associated information (e.g.,
    Glossary, Index, etc.)
  • Restore customers markup tags (re-markup) for
    saving typesetting work
Write a Comment
User Comments (0)
About PowerShow.com