Compiling for EDGE Architectures: The TRIPS Prototype Compiler

About This Presentation

Title:

Compiling for EDGE Architectures: The TRIPS Prototype Compiler

Description:

ASIC/FPGA - similar in that it often involves mapping a dataflow graph onto a partiitoned substrate, but ... Backend Compiler Flow Correctness: ... – PowerPoint PPT presentation

Number of Views:89

Avg rating:3.0/5.0

Slides: 68

Provided by: Katheri172

Learn more at: https://www.cs.utexas.edu

Category:

more less

Transcript and Presenter's Notes

Title: Compiling for EDGE Architectures: The TRIPS Prototype Compiler

1
Compiling for EDGE ArchitecturesThe TRIPS
Prototype Compiler
Kathryn McKinley Doug Burger, Steve Keckler, Jim
Burrill1, Xia Chen, Katie Coons, Sundeep
Kushwaha, Bert Maher, Nick Nethercote, Aaron
Smith, Bill Yoder et al. The University of Texas
at Austin 1University of Massachusetts, Amherst
2
Technology Scaling Hitting the Wall
Qualitatively
Analytically
35 nm
70 nm
100 nm
130 nm
20 mm chip edge
Either way Partitioning for on-chip
communication is key
3
OO SuperScalars Out of Steam

Clock ride is over
Wire and pipeline limits
Quadratic out-of-order issue logic
Power, a first order constraint
Problems for any architectural solution
ILP - instruction level parallelism
Memory and on-chip latency
Major vendors ending processor lines

4
OO SuperScalars Out of Steam

Clock ride is over
Wire and pipeline limits
Quadratic out-of-order issue logic
Power, a first order constraint
Problems for any architectural solution
ILP - instruction level parallelism
Memory and on-chip latency
Major vendors ending processor lines

Whats next?
5
Post-RISC Solutions

CMP - An evolutionary path
Replicate what we already have 2 to N times on a
chip
Coarse grain parallelism
Exposes the resources to the programmer and
compiler
Explicit Data Graph Execution (EDGE)
1. Program graph is broken into sequence of
blocks
Blocks commit atomically or not - a block never
partially commits
2. Dataflow within a block, ISA support for
direct producer-consumer communication
No shared named registers (point-to-point
dataflow edges only)
Memory is still a shared namespace
The blocks dataflow graph (DFG) is explicit in
the architecture

6
Outline

TRIPS Execution Model ISA
TRIPS Architectural Constraints
Compiler Structure
Spatial Path Scheduling

7
Block Atomic Execution Model
TRIPS block Flow Graph
Dataflow Graph
Execution Substrate
add add ld cmp
read
Register File
write
read
read
shl ld cmp br
ld shl sw br
Data Caches
write
read
sw sw add br
write

TRIPS block - single entry constrained hyperblock
Dataflow execution w/ target position encoding

8
TRIPS Block Constraints
Register banks

Fixed Size 128 instructions
Padded with no-ops if needed
Load/Store Identifiers 32 load or store queue
identifiers
More than 32 static loads and stores is possible
Registers 32 reads and 32 writes, 8 to each of
4 banks (in addition to 128)

32 reads 32 writes
32 loads
32 stores
1 - 128 instruction DFG
Memory
Memory
PC read
PC
terminating branch
PC

Constant Output all stores and writes execute,
one branch
Simplifies hardware logic for detecting block
completion
Every path of execution through a block must
produce the same stores and register writes

Simplifies the hardware, more work for the
compiler
9
Compiler Phases (Classic)
PRE Global Value Numbering Scalar
Replacement Global Variable Replacement SCC Copy
Propagation Array Access Strength
Reduction LICM Tree Height Reduction Useless Copy
Removal Dead Variable Elimination
Scale Compiler (UTexas/UMass)
C
FORTRAN
Frontend
Inlining Unrolling/Flattening Scalar Optimizations
Code Generation

TIL TRIPS Intermediate Language - RISC-like
three-address form
TASL TRIPS Assembly Language - dataflow target
form w/ locations encoded

Alpha
TRIPS TIL
SPARC
PPC
10
Backend Compiler Flow
Hyperblock Formation
If-conversion Loop peeling While loop
unrolling Instruction merging Predicate
optimizations
TIL
Resource Allocation
Register allocation Reverse if-conversion
split Load/Store ID assignment SSA for constant
outputs
Scheduling
Fanout insertion Instruction placement Target
form generation
TASL
11
CorrectnessProgressively Satisfy Constraints
Constraint 128 instructions 32 load/store
IDs 32 reg. read/write (8 per 4 banks) constant
output
Hyperblock Formation
If-conversion Loop peeling While loop
unrolling Instruction merging Predicate
optimizations
TIL
Resource Allocation
Register allocation Reverse if-conversion
split Load/Store ID assignment SSA for constant
outputs
Scheduling
Fanout insertion Instruction placement Target
form generation
TASL
12
Predication Hyperblock Formation

Predication
Convert control dependence to data dependence
Improves instruction fetch bandwidth
Eliminates branch mispredictions
Adds overhead
Any instruction can have a predicate, but...
Predicate head (low power) or bottom
(speculative)
Hyperblock
Scheduling region (set of basic blocks)
Single entry, multiple exit, predicated
instructions
Expose parallelism w/o over saturating resources
Must satisfy block constraints

head
bottom
P
P
13
Accuracy?
Constraint 128 instructions 32 load/store
IDs 32 reg. read/write (8 per 4 banks) constant
output
Hyperblock Formation
If-conversion Loop peeling While loop
unrolling Instruction merging Predicate
optimizations
TIL
Resource Allocation
Register allocation Reverse if-conversion
split Load/Store ID assignment SSA for constant
outputs
Scheduling
Fanout insertion Instruction placement Target
form generation
TASL
14
Block Atomic Execution Model
TRIPS block Flow Graph
Dataflow Graph
Execution Substrate
add add ld cmp
read
Register File
write
read
read
shl ld cmp br
ld shl sw br
Data Caches
write
read
sw sw add br
write
TRIPS block - single entry constrained
hyperblock Dataflow execution w/ target position
encoding
15
Spatial Scheduling Problem
Partitioned microarchitecture
16
Spatial Scheduling Problem
Partitioned microarchitecture
Anchor points
17
Spatial Scheduling Problem
Balance latency and concurrency
Partitioned microarchitecture
Anchor points
18
Outline

Background
Spatial Path Scheduling
Simulated Annealing
Extending SPS
Conclusions and Future Work

19
Dissecting the Problem

Scheduling can have two components
Placement Where an instruction executes
Issue When an instruction executes

Placement
Issue
EDGE
20
Explicit Data Graph Execution

Block-atomic execution
Instruction groups fetch, execute, and commit
atomically
Direct instruction communication
Explicitly encode dataflow graph by specifying
targets

RISC
EDGE
R5
R6
R4
add r1, r4, r5 add r2, r5, r6 add r3, r1, r2
i1 add i3 i2 add i3 i3 add i4
add
add
Centralized Register File
add
21
Scheduling for TRIPS

TRIPS ISA
Up to 128 instructions/block
Any instruction can be in any slot
TRIPS microarchitecture
Up to 8 blocks in flight
1 cycle latency between
adjacent ALUs
Known
Execution latencies
Lower bound for
communication latency
Unknown (estimated)
Memory access latencies
Resource conflicts

TRIPS ISA
Up to 128 instructions/block
Any instruction can be in any slot
TRIPS microarchitecture
Up to 8 blocks in flight
1 cycle latency between
adjacent ALUs
Known
Execution latencies
Lower bound for
communication latency
Unknown
Memory access latencies
Resource conflicts

GRST PACT 04 Based on VLIW list-scheduling
Augmented with five heuristics
Prioritizes critical path (C)
Reprioritizes after each placement (R)
Accounts for data cache locality (L)
Accounts for register output locality (O)
Load balancing for local issue contention (B)
Drawbacks
Unnecessary restrictions on scheduling order
Inelegant and overly specific
Replace heuristics with elegant approach designed
for spatial scheduling

24
Greedy Scheduling for TRIPS

GRST PACT 04 Based on VLIW list-scheduling
Augmented with five heuristics
Prioritizes critical path (C)
Reprioritizes after each placement (R)
Accounts for data cache locality (L)
Accounts for register output locality (O)
Load balancing for local issue contention (B)
Drawbacks
Unnecessary restrictions on scheduling order
Inelegant and overly specific
Replace heuristics with elegant approach designed
for spatial scheduling

25
Outline

Background
Spatial Path Scheduling
Simulated Annealing
Extending SPS
Conclusions and Future Work

26
Spatial Path Scheduling Overview
Scheduler
Dataflow Graph
Placement
Topology
27
Spatial Path Scheduling Overview
R1
add
mul
Scheduler
Dataflow Graph
ctrl
D0
D1
R2
mul
ld
ld
Placement
Topology
28
Spatial Path Scheduling Overview
Scheduler
Dataflow Graph
Placement
Topology
29
Spatial Path Scheduling Overview

Initialize all known anchor points
Until all instructions are scheduled
Populate the open list
Find placement costs
Choose the minimum cost location
Schedule the instruction whose minimum placement
cost is largest
(Choose the max of the mins)

30
Spatial Path Scheduling Example

Initialize all known anchor points

Populate the open list
(marked in yellow)

Open list Instructions that are candidates for
scheduling We include Instructions with no
parents, or with at least one placed parent
32
Spatial Path Scheduling Example

Calculate placement cost for
each instruction in the open
list at each slot

Placement cost(i,slot) Longest path length
through i if placed at slot cost inputCost
execCost outputCost (includes communication and
execution latencies)
33
Spatial Path Scheduling Example

Calculate placement cost for
each instruction in the open
list at each slot

5
Register File
ctrl
R1
R2
1 cycle
D0
mul E1
3
3 cycles
D1
5 cycles
Data Cache
1
Total placement cost 16 3 3 22
34
Spatial Path Scheduling Example

Calculate placement cost for
each instruction in the open
list at each slot

mul mul mul mul
24 24 22 24
22 22 22 24
24 24 24 28
26 26 26 28
add add add add
10 8 8 10
10 10 10 12
12 12 12 14
14 14 14 16
Register File
mul mul mul mul
22 22 24 26
22 22 24 26
24 24 26 28
26 26 28 30
Data Cache
add add add add
22 22 24 26
22 22 24 26
24 24 26 28
26 26 28 30
35
Spatial Path Scheduling Example

Choose the minimum cost
location for each instruction

Break ties
Example heuristics
Links consumed
ALU utilization

Place the instruction with the
highest minimum cost
(Choose the max of the mins)

mul mul mul mul
24 24 22 24
22 22 22 24
24 24 24 28
26 26 26 28
add add add add
10 8 8 10
10 10 10 12
12 12 12 14
14 14 14 16
Register File
ctrl
R1
R2
D0
mul
mul mul mul mul
22 22 24 26
22 22 24 26
24 24 26 28
26 26 28 30
D1
Data Cache
add add add add
22 22 24 26
22 22 24 26
24 24 26 28
26 26 28 30
38
Spatial Path Scheduling Algorithm

Schedule (block, topology)
initialize known anchor points
while (not all instructions scheduled)
for each instruction in open list, i
for each available location, n
calculate placement cost for (i, n)
keep track of n with min placement cost
keep track of i with highest min placement
cost
schedule i with highest min placement cost
Per-block complexity
SPS O(i2 n) i of instructions
n of ALUs
GRST O(i2 i n)
Exhaustive search i !

39
SPS Benefits and Limitations

Benefits
Automatically exploits known communication
latencies
Designed for spatial scheduling
Minimizes critical path length at each step
Naturally encompasses four of five GRST
heuristics
Limitations of basic algorithm
Does not account for resource contention
Uses no global information
Minimum communication latencies may be optimistic

40
Experimental Methodology

26 hand-optimized microbenchmarks
Extracted from SPEC2000, EEMBC, Livermore Loops,
MediaBench, and C libraries
Average dynamic instructions fetched/block 67.3
(Ranges from
14.5 to 117.5)
Cycle-accurate simulator
Within 4 of RTL on average
Models communication and contention delays
Comparison points
Greedy Scheduling for TRIPS (GRST)
Simulated annealing

41
SPS Performance
Geometric mean of speedup over GRST 1.19
Basic SPS
42
SPS Performance
Geometric mean of speedup over GRST 1.19
Basic SPS
43
SPS Performance
Geometric mean of speedup over GRST 1.19
Basic SPS
44
Outline

Background
Spatial Path Scheduling
Simulated Annealing
Extending SPS
Conclusions and Future Work

45
How well can we do?

Simulated annealing
Artificial intelligence search technique
Uses random perturbations to avoid local optima
Approximates a global optimum
Cost function simulated cycles
Uncertainty makes static cost functions
insufficient
Best cost function
Purpose
Optimization
Discover performance upper bound
Tool to improve scheduler

46
Speedup with Simulated Annealing
Geometric mean of speedup over GRST Basic SPS
1.19 Annealed 1.40
Basic SPS
Annealed
47
Speedup with Simulated Annealing
Geometric mean of speedup over GRST Basic SPS
1.19 Annealed 1.40
Basic SPS
Annealed
48
Speedup with Simulated Annealing
Geometric mean of speedup over GRST Basic SPS
1.19 Annealed 1.40
Basic SPS
Annealed
49
Outline

Background
Spatial Path Scheduling
Simulated Annealing
Extending SPS
Conclusions and Future Work

50
Extending SPS

Contention
Network link contention
Local and Global ALU contention
Global register prioritization
Path volume scheduling

51
ALU Contention

What if two instructions are ready to execute on
the same ALU at the same time?

read R2
add
mul
Register File
br
ld
ld
D0
D2
ctrl
read R1
mul
Data Cache
add
write R1
52
Local vs. Global ALU Contention

Local ALU contention
Keep track of expected issue time
Increase placement cost if conflict occurs
Global ALU contention
Resource utilization in previous/next block
Weighting function
Modify placement cost

53
Speedup over GRST
Geometric mean of speedup over GRST Basic SPS
1.19 SPS extended 1.31
Annealed 1.40
Basic SPS
Annealed
SPS extended
54
Speedup over GRST
Geometric mean of speedup over GRST Basic SPS
1.19 SPS extended 1.31
Annealed 1.40
Basic SPS
Annealed
SPS extended
55
Speedup over GRST
Geometric mean of speedup over GRST Basic SPS
1.19 SPS extended 1.31
Annealed 1.40
Basic SPS
Annealed
SPS extended
56
Related Work

Scheduling for VLIW Ellis, Fisher
Scheduling for other partitioned architectures
Partitioned VLIW Gilbert, Kailas, Kessler,
Özer, Qian, Zalamea
RAW Lee
Wavescalar Mercaldi
ASIC and FPGA place and route Paulin
Resource conflicts known statically
Substrate may not be fixed
Simulated annealing Betz

57
Conclusions and Future Work

Future work
Register allocation
Memory placement
Reliability-aware scheduling
Conclusions
General spatial instruction scheduling algorithm
Reasons explicitly about anchor points
Performance within 4 of annealed results

58
Questions?
59
Mapping instructions to Physical Locations

Scheduler converts operand format to target
format, and assigns IDs
ID assigned to each instruction indicates
physical location
The microarchitecture can interpret this ID in
many different ways
To schedule well, the scheduler must understand
how the microarchitecture translates ID -gt
Physical location
TIL (operand format) TASL(target format)
read t0, g1
read t1, g2
muli t2, t1, 4
ld t3, 0(t2)
ld t4, 4(t2)
mul t5, t3, t4
add t6, t5, t0
addi t7, t1, 8
br t7
write g1, t6

R1 read, G1, N5 R2 read, N2, N6
N2 muli, N34, N1 N34 ld, N32 N1
ld, N32 N32 mul, N5 N5 add,
W1 N6 addi, N0 N0 br
Scheduler
W1 write, G1
60
Mapping instructions to Physical Locations

Scheduler converts operand format to target
format, and assigns IDs
ID assigned to each instruction indicates
physical location
The microarchitecture can interpret this ID in
many different ways
To schedule well, the scheduler must understand
how the microarchitecture translates ID -gt
Physical location
TASL(target format)

ctrl
R1 read, G1, N5 R2 read, N2, N6
D0
N2 muli, N34, N1 N34 ld, N32 N1
ld, N32 N32 mul, N5 N5 add,
W1 N6 addi, N0 N0 br
D1
D2
D3
W1 write, G1
61
Mapping instructions to Physical Locations

Scheduler converts operand format to target
format, and assigns IDs
ID assigned to each instruction indicates
physical location
The microarchitecture can interpret this ID in
many different ways
To schedule well, the scheduler must understand
how the microarchitecture translates ID -gt
Physical location
TASL(target format)

ctrl
R1 read, G1, N5 R2 read, N2, N6
D0
N2 muli, N34, N1 N34 ld, N32 N1
ld, N32 N32 mul, N5 N5 add,
W1 N6 addi, N0 N0 br
D1
D2
D3
W1 write, G1
62
Mapping instructions to Physical Locations

Scheduler converts operand format to target
format, and assigns IDs
ID assigned to each instruction indicates
physical location
The microarchitecture can interpret this ID in
many different ways
To schedule well, the scheduler must understand
how the microarchitecture translates ID -gt
Physical location
TASL(target format)

ctrl
R1 read, G1, N5 R2 read, N2, N6
D0
N2 muli, N34, N1 N34 ld, N32 N1
ld, N32 N32 mul, N5 N5 add,
W1 N6 addi, N0 N0 br
D1
D2
D3
W1 write, G1
63
Simulated Annealing Over Time
64
Simulated Annealing

Cost function Simulated cycles
Prune space further with critical path tool

Guided vs. unguided Annealing for memset_hand
65
Contention

ALU contention
Local (within a block) - Estimate temporal
schedule
Global (between blocks) - Probabilistic - use
weighting function
Network link contention
Precise measurements too inaccurate
Estimate with threshold, weighting function
Weight network link and global ALU contention
based on annealed results

criticality concurrency
weight (1 - fullness) (1 -
)
66
Global Register Prioritization

Problem Any register dependence may be
important with speculative execution
Solution Extend path lengths through registers
Register prioritization
Schedule smaller loops before larger loops
Schedule loop-carried dependences first
Extend placement cost through registers to
previous/next block

67
Path Volume Scheduling

Problem The basic SPS algorithm does not
account for the number of instructions in the
path
Solution Perform a depth-first search with
iterative deepening to find the shortest path
that holds all instructions

Write a Comment

User Comments (0)