Processor Architectures and Program Mapping - PowerPoint PPT Presentation

1 / 46
About This Presentation
Title:

Processor Architectures and Program Mapping

Description:

Processor Architectures and Program Mapping. TU/e 5kk10. Henk Corporaal ... Processor Architectures and Program Mapping H. Corporaal, J. ... peephole ... – PowerPoint PPT presentation

Number of Views:37
Avg rating:3.0/5.0
Slides: 47
Provided by: abc774
Category:

less

Transcript and Presenter's Notes

Title: Processor Architectures and Program Mapping


1
Processor Architectures and Program Mapping
Exploiting ILP part 2 code generation
  • TU/e 5kk10
  • Henk Corporaal
  • Jef van Meerbergen
  • Bart Mesman

2
Overview
  • Enhance performance architecture methods
  • Instruction Level Parallelism
  • VLIW
  • Examples
  • C6
  • TM
  • TTA
  • Clustering
  • Code generation
  • Hands-on

3
Compiler basics
  • Overview
  • Compiler trajectory / structure / passes
  • Control Flow Graph (CFG)
  • Mapping and Scheduling
  • Basic block list scheduling
  • Extended scheduling scope
  • Loop schedulin

4
Compiler basics trajectory
Source program
Preprocessor
Compiler
Error messages
Assembler
Library code
Loader/Linker
Object program
5
Compiler basics structure / passes
Source code
Lexical analyzer
token generation
check syntax check semantic
parse tree generation
Parsing
Intermediate code
data flow analysis local optimizations
global optimizations
Code optimization
code selection peephole optimizations
Code generation
making interference graph graph
coloring
spill code insertion
caller / callee save and restore code
Register allocation
Sequential code
Scheduling and allocation
exploiting ILP
Object code
6
Compiler basics structure Simple compilation
example
position initial rate 60
Lexical analyzer
temp1 intoreal(60) temp2 id3 temp1 temp3
id2 temp2 id1 temp3
id id id 60
Syntax analyzer
Code optimizer
temp1 id3 60.0 id1 id2 temp1
Code generator
movf id3, r2 mulf 60, r2, r2 movf id2,
r1 addf r2, r1 movf r1, id1
Intermediate code generator
7
Compiler basics Control flow graph (CFG)
C input code
if (a gt b) r a b else r b
a
1 sub t1, a, b bgz t1, 2, 3
CFG
2 rem r, a, b goto 4
3 rem r, b, a goto 4
4 .. ..
Program, is collection of Functions, each
function is collection of Basic Blocks,
each BB contains set of
Instructions, each instruction consists of
several Transports,..
8
Mapping / Scheduling placing operations in
space and time
a
b
2
  • d a b
  • e a d
  • f 2 b d
  • r f e
  • x z y



d
z
y



e
f
-
x
r
Data Dependence Graph (DDG)
9
How to map these operations?
  • Architecture constraints
  • One Function Unit
  • All operations single cycle latency

b
a
2


d
cycle


z
1
y

e
f

2

-
3
x

r

4
5
-
6

10
How to map these operations?
  • Architecture constraints
  • One Add-sub and one Mul unit
  • All operations single cycle latency

Mul
Add-sub
cycle
1


2


3

-
4
5
6
11
There are many mapping solutions
12
Basic Block Scheduling
  • Make a dependence graph
  • Determine minimal length
  • Determine ASAP, ALAP, and slack of each operation
  • Place each operation in first cycle with
    sufficient resources
  • Note
  • Scheduling order sequential
  • Priority determined by used heuristic e.g. slack

13
Basic Block Scheduling
ASAP cycle
B
C
ALAP cycle
ADD
A
slack
lt1,1gt
A
C
SUB
lt2,2gt
ADD
NEG
LD
lt3,3gt
lt1,3gt
lt2,3gt
A
B
LD
MUL
ADD
lt4,4gt
lt2,4gt
lt1,4gt
z
y
X
14
Cycle based list scheduling
proc Schedule(DDG (V,E)) beginproc ready
v ??(u,v) ? E ready ready
sched ? current_cycle 0 while
sched ? V do for each v ? ready do
if ?ResourceConfl(v,current_cycle,
sched) then cycle(v)
current_cycle sched sched ?
v endif endfor
current_cycle current_cycle 1
ready v v ? sched ? ? (u,v)? E, u ? sched
ready v v ? ready ? ? (u,v)?
E, cycle(u) delay(u,v) ? current_cycle
endwhile endproc
15
Extended basic block scheduling Code Motion
  • Downward code motions?
  • a ? B, a ? C, a ? D, c ? D, d ? D
  • Upward code motions?
  • c ? A, d ? A, e ? B, e ? C, e ? A

16
Extended Scheduling scope
Code
CFG Control Flow Graph
A
A If cond Then B Else C D If cond Then
E Else F G
C
B
D
F
E
G
17
Scheduling scopes
Trace Superblock Decision tree
Hyperblock/region
18
Code movement (upwards) within regions
destination block
Legend
Copy needed
I
I
Intermediate block
I
I
Check for off-liveness
Code movement
I
add
source block
19
Extended basic block schedulingCode Motion
  • A dominates B ? A is always executed before B
  • Consequently
  • A does not dominate B ? code motion from B to A
    requires
  • code duplication
  • B post-dominates A ? B is always executed after A
  • Consequently
  • B does not post-dominate A ? code motion from B
    to A is speculative

Q1 does C dominate E? Q2 does C dominate D? Q3
does F post-dominate D? Q4 does D post-dominate
B?
20
Scheduling Loops
Loop Optimizations
A
B
C
A
D
A
B
C
B
C
C
C
C
C
D
D
Loop unrolling
Loop peeling
21
Scheduling Loops
  • Problems with unrolling
  • Exploits only parallelism within sets of n
    iterations
  • Iteration start-up latency
  • Code expansion

Basic block scheduling
Basic block scheduling and unrolling
resource utilization
Software pipelining
time
22
Software pipelining
  • Software pipelining a loop is
  • Scheduling the loop such that iterations start
    before preceding iterations have finished
  • Or
  • Moving operations across the backedge

LD LD ML LD ML ST ML ST ST
Unroling 5/3 cycles/iteration
Software pipelining 1 cycle/iteration
3 cycles/iteration
23
Software pipelining (contd)
  • Basic techniques
  • Modulo scheduling (Rau, Lam)
  • list scheduling with modulo resource constraints
  • Kernel recognition techniques
  • unroll the loop
  • schedule the iterations
  • identify a repeating pattern
  • Examples
  • Perfect pipelining (Aiken and Nicolau)
  • URPR (Su, Ding and Xia)
  • Petri net pipelining (Allan)
  • Enhanced pipeline scheduling (Ebcioglu)
  • fill first cycle of iteration
  • copy this instruction over the backedge

24
Software pipelining Modulo scheduling
Example Modulo scheduling a loop
ld r1,(r2) mul r3,r1,3 sub r4,r3,1 st r4,(r5)
Prologue
ld r1,(r2) mul r3,r1,3 sub r4,r3,1 st r4,(r5)
ld r1,(r2) mul r3,r1,3 sub r4,r3,1 st r4,(r5)
Kernel
ld r1,(r2) mul r3,r1,3 sub r4,r3,1 st r4,(r5)
Epilogue
(c) Software pipeline
  • Prologue fills the SW pipeline with iterations
  • Epilogue drains the SW pipeline

25
Software pipelining determine II, Initation
Interval
Cyclic data dependences
For (i0.....) Ai6 3Ai-1
cycle(v) ? cycle(u) delay(u,v) -
II.distance(u,v)
26
Modulo scheduling constraints
MII minimum initiation interval bounded by cyclic
dependences and resources
MII max ResMII, RecMII
27
The Role of the Compiler
  • 9 steps required to translate an HLL program
  • Front-end compilation
  • Determine dependencies
  • Graph partitioning make multiple threads (or
    tasks)
  • Bind partitions to compute nodes
  • Bind operands to locations
  • Bind operations to time slots Scheduling
  • Bind operations to functional units
  • Bind transports to buses
  • Execute operations and perform transports

28
Division of responsibilities between hardware and
compiler
Application
Frontend
Superscalar
Determine Dependencies
Determine Dependencies
Dataflow
Binding of Operands
Binding of Operands
Multi-threaded
Scheduling
Scheduling
Indep. Arch
Binding of Operations
Binding of Operations
VLIW
Binding of Transports
Binding of Transports
TTA
Execute
Responsibility of compiler
Responsibility of Hardware
29
Overview
  • Enhance performance architecture methods
  • Instruction Level Parallelism
  • VLIW
  • Examples
  • C6
  • TM
  • TTA
  • Clustering
  • Code generation
  • Hands-on

30
Hands-on (not this year)
  • Map JPEG to a TTA processor
  • see web page http//www.ics.ele.tue.nl/heco/cour
    ses/pam
  • Install TTA tools (compiler and simulator)
  • Go through all listed steps
  • Perform DSE design space exploration
  • Add SFU
  • 1 or 2 page report in 2 weeks

31
Hands-on
  • Lets look at DSE Design Space Exploration
  • We will use the Imagine processor
  • http//cva.stanford.edu/projects/imagine/

32
Mapping applications to processorsMOVE framework
User intercation
Optimizer
Architecture parameters
feedback
feedback
Parametric compiler
Hardware generator
Move framework
Parallel object code
chip
TTA based system
33
Code generation trajectory for TTAs
  • Frontend
  • GCC or SUIF
  • (adapted)

Application (C)
Compiler frontend
Sequential code
Sequential simulation
Input/Output
Architecture description
Compiler backend
Profiling data
Parallel code
Parallel simulation
Input/Output
34
Exploration TTA resource reduction
35
Exporation TTA connectivity reduction
Critical connections disappear
Reducing bus delay
Execution time
FU stage constrains cycle time
0
Number of connections removed
36
Can we do better
Yes !!
  • How ?
  • Transformations
  • SFUs Special Function Units
  • Multiple Processors

37
Transforming the specification



Based on associativity of operation a (b c)
(a b) c
38
Transforming the specification
r 2b a x z y
d a b e a d f 2 b d r f e x
z y
1
b
z
y
a
ltlt

-
x
r
39
Changing the architectureadding SFUs special
function units



4-input adder why is this faster?
40
Changing the architectureadding SFUs special
function units
  • In the extreme case put everything into one unit!

Spatial mapping - no control flow
However no flexibility / programmability !!
41
SFUs fine grain patterns
  • Why using fine grain SFUs
  • Code size reduction
  • Register file ports reduction
  • Could be cheaper and/or faster
  • Transport reduction
  • Power reduction (avoid charging non-local wires)
  • Supports whole application domain !
  • Which patterns do need support?
  • Detection of recurring operation patterns needed

42
SFUs covering results
43
Exploration resulting architecture
  • Architecture for image processing
  • Note the reduced connectivity

44
Conclusions
  • Billions of embedded processing systems
  • how to design these systems quickly, cheap,
    correct, low power,.... ?
  • what will their processing platform look like?
  • VLIWs are very powerful and flexible
  • can be easily tuned to application domain
  • TTAs even more flexible, scalable, and lower power

45
Conclusions
  • Compilation for ILP architectures is getting
    mature, and
  • Enters the commercial area.
  • However
  • Great discrepancy between available and
    exploitable parallelism
  • Advanced code scheduling techniques needed to
    exploit ILP

46
Bottom line
Do not pay for hardware if
you can do it by software !!
Write a Comment
User Comments (0)
About PowerShow.com