Title: Compiling for EDGE Architectures: The TRIPS Prototype Compiler
1Compiling for EDGE ArchitecturesThe TRIPS
Prototype Compiler
Kathryn McKinley Doug Burger, Steve Keckler, Jim
Burrill1, Xia Chen, Katie Coons, Sundeep
Kushwaha, Bert Maher, Nick Nethercote, Aaron
Smith, Bill Yoder et al. The University of Texas
at Austin 1University of Massachusetts, Amherst
2Technology Scaling Hitting the Wall
Qualitatively
Analytically
35 nm
70 nm
100 nm
130 nm
20 mm chip edge
Either way Partitioning for on-chip
communication is key
3OO SuperScalars Out of Steam
- Clock ride is over
- Wire and pipeline limits
- Quadratic out-of-order issue logic
- Power, a first order constraint
- Problems for any architectural solution
- ILP - instruction level parallelism
- Memory and on-chip latency
- Major vendors ending processor lines
4OO SuperScalars Out of Steam
- Clock ride is over
- Wire and pipeline limits
- Quadratic out-of-order issue logic
- Power, a first order constraint
- Problems for any architectural solution
- ILP - instruction level parallelism
- Memory and on-chip latency
- Major vendors ending processor lines
Whats next?
5Post-RISC Solutions
- CMP - An evolutionary path
- Replicate what we already have 2 to N times on a
chip - Coarse grain parallelism
- Exposes the resources to the programmer and
compiler - Explicit Data Graph Execution (EDGE)
- 1. Program graph is broken into sequence of
blocks - Blocks commit atomically or not - a block never
partially commits - 2. Dataflow within a block, ISA support for
direct producer-consumer communication - No shared named registers (point-to-point
dataflow edges only) - Memory is still a shared namespace
- The blocks dataflow graph (DFG) is explicit in
the architecture
6Outline
- TRIPS Execution Model ISA
- TRIPS Architectural Constraints
- Compiler Structure
- Spatial Path Scheduling
7Block Atomic Execution Model
TRIPS block Flow Graph
Dataflow Graph
Execution Substrate
add add ld cmp
read
Register File
write
read
read
shl ld cmp br
ld shl sw br
Data Caches
write
read
sw sw add br
write
- TRIPS block - single entry constrained hyperblock
- Dataflow execution w/ target position encoding
8TRIPS Block Constraints
Register banks
- Fixed Size 128 instructions
- Padded with no-ops if needed
- Load/Store Identifiers 32 load or store queue
identifiers - More than 32 static loads and stores is possible
- Registers 32 reads and 32 writes, 8 to each of
4 banks (in addition to 128)
32 reads 32 writes
32 loads
32 stores
1 - 128 instruction DFG
Memory
Memory
PC read
PC
terminating branch
PC
- Constant Output all stores and writes execute,
one branch - Simplifies hardware logic for detecting block
completion - Every path of execution through a block must
produce the same stores and register writes
Simplifies the hardware, more work for the
compiler
9Compiler Phases (Classic)
PRE Global Value Numbering Scalar
Replacement Global Variable Replacement SCC Copy
Propagation Array Access Strength
Reduction LICM Tree Height Reduction Useless Copy
Removal Dead Variable Elimination
Scale Compiler (UTexas/UMass)
C
FORTRAN
Frontend
Inlining Unrolling/Flattening Scalar Optimizations
Code Generation
- TIL TRIPS Intermediate Language - RISC-like
three-address form - TASL TRIPS Assembly Language - dataflow target
form w/ locations encoded
Alpha
TRIPS TIL
SPARC
PPC
10Backend Compiler Flow
Hyperblock Formation
If-conversion Loop peeling While loop
unrolling Instruction merging Predicate
optimizations
TIL
Resource Allocation
Register allocation Reverse if-conversion
split Load/Store ID assignment SSA for constant
outputs
Scheduling
Fanout insertion Instruction placement Target
form generation
TASL
11CorrectnessProgressively Satisfy Constraints
Constraint 128 instructions 32 load/store
IDs 32 reg. read/write (8 per 4 banks) constant
output
Hyperblock Formation
If-conversion Loop peeling While loop
unrolling Instruction merging Predicate
optimizations
TIL
Resource Allocation
Register allocation Reverse if-conversion
split Load/Store ID assignment SSA for constant
outputs
Scheduling
Fanout insertion Instruction placement Target
form generation
TASL
12Predication Hyperblock Formation
- Predication
- Convert control dependence to data dependence
- Improves instruction fetch bandwidth
- Eliminates branch mispredictions
- Adds overhead
- Any instruction can have a predicate, but...
- Predicate head (low power) or bottom
(speculative) - Hyperblock
- Scheduling region (set of basic blocks)
- Single entry, multiple exit, predicated
instructions - Expose parallelism w/o over saturating resources
- Must satisfy block constraints
head
bottom
P
P
13Accuracy?
Constraint 128 instructions 32 load/store
IDs 32 reg. read/write (8 per 4 banks) constant
output
Hyperblock Formation
If-conversion Loop peeling While loop
unrolling Instruction merging Predicate
optimizations
TIL
Resource Allocation
Register allocation Reverse if-conversion
split Load/Store ID assignment SSA for constant
outputs
Scheduling
Fanout insertion Instruction placement Target
form generation
TASL
14Block Atomic Execution Model
TRIPS block Flow Graph
Dataflow Graph
Execution Substrate
add add ld cmp
read
Register File
write
read
read
shl ld cmp br
ld shl sw br
Data Caches
write
read
sw sw add br
write
TRIPS block - single entry constrained
hyperblock Dataflow execution w/ target position
encoding
15Spatial Scheduling Problem
Partitioned microarchitecture
16Spatial Scheduling Problem
Partitioned microarchitecture
Anchor points
17Spatial Scheduling Problem
Balance latency and concurrency
Partitioned microarchitecture
Anchor points
18Outline
- Background
- Spatial Path Scheduling
- Simulated Annealing
- Extending SPS
- Conclusions and Future Work
19Dissecting the Problem
- Scheduling can have two components
- Placement Where an instruction executes
- Issue When an instruction executes
Placement
Issue
EDGE
20Explicit Data Graph Execution
- Block-atomic execution
- Instruction groups fetch, execute, and commit
atomically - Direct instruction communication
- Explicitly encode dataflow graph by specifying
targets
RISC
EDGE
R5
R6
R4
add r1, r4, r5 add r2, r5, r6 add r3, r1, r2
i1 add i3 i2 add i3 i3 add i4
add
add
Centralized Register File
add
21Scheduling for TRIPS
- TRIPS ISA
- Up to 128 instructions/block
- Any instruction can be in any slot
- TRIPS microarchitecture
- Up to 8 blocks in flight
- 1 cycle latency between
- adjacent ALUs
- Known
- Execution latencies
- Lower bound for
- communication latency
- Unknown (estimated)
- Memory access latencies
- Resource conflicts
Register File
Data Cache
22Scheduling for TRIPS
- TRIPS ISA
- Up to 128 instructions/block
- Any instruction can be in any slot
- TRIPS microarchitecture
- Up to 8 blocks in flight
- 1 cycle latency between
- adjacent ALUs
- Known
- Execution latencies
- Lower bound for
- communication latency
- Unknown
- Memory access latencies
- Resource conflicts
Register File
Data Cache
23Greedy Scheduling for TRIPS
- GRST PACT 04 Based on VLIW list-scheduling
- Augmented with five heuristics
- Prioritizes critical path (C)
- Reprioritizes after each placement (R)
- Accounts for data cache locality (L)
- Accounts for register output locality (O)
- Load balancing for local issue contention (B)
- Drawbacks
- Unnecessary restrictions on scheduling order
- Inelegant and overly specific
- Replace heuristics with elegant approach designed
for spatial scheduling
24Greedy Scheduling for TRIPS
- GRST PACT 04 Based on VLIW list-scheduling
- Augmented with five heuristics
- Prioritizes critical path (C)
- Reprioritizes after each placement (R)
- Accounts for data cache locality (L)
- Accounts for register output locality (O)
- Load balancing for local issue contention (B)
- Drawbacks
- Unnecessary restrictions on scheduling order
- Inelegant and overly specific
- Replace heuristics with elegant approach designed
for spatial scheduling
25Outline
- Background
- Spatial Path Scheduling
- Simulated Annealing
- Extending SPS
- Conclusions and Future Work
26Spatial Path Scheduling Overview
Scheduler
Dataflow Graph
Placement
Topology
27Spatial Path Scheduling Overview
R1
add
mul
Scheduler
Dataflow Graph
ctrl
D0
D1
R2
mul
ld
ld
Placement
Topology
28Spatial Path Scheduling Overview
Scheduler
Dataflow Graph
Placement
Topology
29Spatial Path Scheduling Overview
- Initialize all known anchor points
- Until all instructions are scheduled
- Populate the open list
- Find placement costs
- Choose the minimum cost location
- Schedule the instruction whose minimum placement
cost is largest - (Choose the max of the mins)
30Spatial Path Scheduling Example
- Initialize all known anchor points
Register File
Data Cache
31Spatial Path Scheduling Example
- Populate the open list
- (marked in yellow)
Open list Instructions that are candidates for
scheduling We include Instructions with no
parents, or with at least one placed parent
32Spatial Path Scheduling Example
- Calculate placement cost for
- each instruction in the open
- list at each slot
Placement cost(i,slot) Longest path length
through i if placed at slot cost inputCost
execCost outputCost (includes communication and
execution latencies)
33Spatial Path Scheduling Example
- Calculate placement cost for
- each instruction in the open
- list at each slot
5
Register File
ctrl
R1
R2
1 cycle
D0
mul E1
3
3 cycles
D1
5 cycles
Data Cache
1
Total placement cost 16 3 3 22
34Spatial Path Scheduling Example
- Calculate placement cost for
- each instruction in the open
- list at each slot
mul mul mul mul
24 24 22 24
22 22 22 24
24 24 24 28
26 26 26 28
add add add add
10 8 8 10
10 10 10 12
12 12 12 14
14 14 14 16
Register File
mul mul mul mul
22 22 24 26
22 22 24 26
24 24 26 28
26 26 28 30
Data Cache
add add add add
22 22 24 26
22 22 24 26
24 24 26 28
26 26 28 30
35Spatial Path Scheduling Example
- Choose the minimum cost
- location for each instruction
mul mul mul mul
24 24 22 24
22 22 22 24
24 24 24 28
26 26 26 28
add add add add
10 8 8 10
10 10 10 12
12 12 12 14
14 14 14 16
Register File
mul mul mul mul
22 22 24 26
22 22 24 26
24 24 26 28
26 26 28 30
Data Cache
add add add add
22 22 24 26
22 22 24 26
24 24 26 28
26 26 28 30
36Spatial Path Scheduling Example
- Break ties
- Example heuristics
- Links consumed
- ALU utilization
mul mul mul mul
24 24 22 24
22 22 22 24
24 24 24 28
26 26 26 28
add add add add
10 8 8 10
10 10 10 12
12 12 12 14
14 14 14 16
Register File
mul mul mul mul
22 22 24 26
22 22 24 26
24 24 26 28
26 26 28 30
Data Cache
add add add add
22 22 24 26
22 22 24 26
24 24 26 28
26 26 28 30
37Spatial Path Scheduling Example
- Place the instruction with the
- highest minimum cost
-
- (Choose the max of the mins)
mul mul mul mul
24 24 22 24
22 22 22 24
24 24 24 28
26 26 26 28
add add add add
10 8 8 10
10 10 10 12
12 12 12 14
14 14 14 16
Register File
ctrl
R1
R2
D0
mul
mul mul mul mul
22 22 24 26
22 22 24 26
24 24 26 28
26 26 28 30
D1
Data Cache
add add add add
22 22 24 26
22 22 24 26
24 24 26 28
26 26 28 30
38Spatial Path Scheduling Algorithm
- Schedule (block, topology)
- initialize known anchor points
- while (not all instructions scheduled)
- for each instruction in open list, i
- for each available location, n
- calculate placement cost for (i, n)
- keep track of n with min placement cost
- keep track of i with highest min placement
cost - schedule i with highest min placement cost
- Per-block complexity
- SPS O(i2 n) i of instructions
- n of ALUs
- GRST O(i2 i n)
- Exhaustive search i !
39SPS Benefits and Limitations
- Benefits
- Automatically exploits known communication
latencies - Designed for spatial scheduling
- Minimizes critical path length at each step
- Naturally encompasses four of five GRST
heuristics - Limitations of basic algorithm
- Does not account for resource contention
- Uses no global information
- Minimum communication latencies may be optimistic
40Experimental Methodology
- 26 hand-optimized microbenchmarks
- Extracted from SPEC2000, EEMBC, Livermore Loops,
- MediaBench, and C libraries
- Average dynamic instructions fetched/block 67.3
(Ranges from - 14.5 to 117.5)
- Cycle-accurate simulator
- Within 4 of RTL on average
- Models communication and contention delays
- Comparison points
- Greedy Scheduling for TRIPS (GRST)
- Simulated annealing
41SPS Performance
Geometric mean of speedup over GRST 1.19
Basic SPS
42SPS Performance
Geometric mean of speedup over GRST 1.19
Basic SPS
43SPS Performance
Geometric mean of speedup over GRST 1.19
Basic SPS
44Outline
- Background
- Spatial Path Scheduling
- Simulated Annealing
- Extending SPS
- Conclusions and Future Work
45How well can we do?
- Simulated annealing
- Artificial intelligence search technique
- Uses random perturbations to avoid local optima
- Approximates a global optimum
- Cost function simulated cycles
- Uncertainty makes static cost functions
insufficient - Best cost function
- Purpose
- Optimization
- Discover performance upper bound
- Tool to improve scheduler
46Speedup with Simulated Annealing
Geometric mean of speedup over GRST Basic SPS
1.19 Annealed 1.40
Basic SPS
Annealed
47Speedup with Simulated Annealing
Geometric mean of speedup over GRST Basic SPS
1.19 Annealed 1.40
Basic SPS
Annealed
48Speedup with Simulated Annealing
Geometric mean of speedup over GRST Basic SPS
1.19 Annealed 1.40
Basic SPS
Annealed
49Outline
- Background
- Spatial Path Scheduling
- Simulated Annealing
- Extending SPS
- Conclusions and Future Work
50Extending SPS
- Contention
- Network link contention
- Local and Global ALU contention
- Global register prioritization
- Path volume scheduling
51ALU Contention
- What if two instructions are ready to execute on
the same ALU at the same time?
read R2
add
mul
Register File
br
ld
ld
D0
D2
ctrl
read R1
mul
Data Cache
add
write R1
52Local vs. Global ALU Contention
- Local ALU contention
- Keep track of expected issue time
- Increase placement cost if conflict occurs
- Global ALU contention
- Resource utilization in previous/next block
- Weighting function
- Modify placement cost
53Speedup over GRST
Geometric mean of speedup over GRST Basic SPS
1.19 SPS extended 1.31
Annealed 1.40
Basic SPS
Annealed
SPS extended
54Speedup over GRST
Geometric mean of speedup over GRST Basic SPS
1.19 SPS extended 1.31
Annealed 1.40
Basic SPS
Annealed
SPS extended
55Speedup over GRST
Geometric mean of speedup over GRST Basic SPS
1.19 SPS extended 1.31
Annealed 1.40
Basic SPS
Annealed
SPS extended
56Related Work
- Scheduling for VLIW Ellis, Fisher
- Scheduling for other partitioned architectures
- Partitioned VLIW Gilbert, Kailas, Kessler,
Ćzer, Qian, Zalamea - RAW Lee
- Wavescalar Mercaldi
- ASIC and FPGA place and route Paulin
- Resource conflicts known statically
- Substrate may not be fixed
- Simulated annealing Betz
57Conclusions and Future Work
- Future work
- Register allocation
- Memory placement
- Reliability-aware scheduling
- Conclusions
- General spatial instruction scheduling algorithm
- Reasons explicitly about anchor points
- Performance within 4 of annealed results
58Questions?
59Mapping instructions to Physical Locations
- Scheduler converts operand format to target
format, and assigns IDs - ID assigned to each instruction indicates
physical location - The microarchitecture can interpret this ID in
many different ways - To schedule well, the scheduler must understand
how the microarchitecture translates ID -gt
Physical location - TIL (operand format) TASL(target format)
- read t0, g1
- read t1, g2
- muli t2, t1, 4
- ld t3, 0(t2)
- ld t4, 4(t2)
- mul t5, t3, t4
- add t6, t5, t0
- addi t7, t1, 8
- br t7
- write g1, t6
R1 read, G1, N5 R2 read, N2, N6
N2 muli, N34, N1 N34 ld, N32 N1
ld, N32 N32 mul, N5 N5 add,
W1 N6 addi, N0 N0 br
Scheduler
W1 write, G1
60Mapping instructions to Physical Locations
- Scheduler converts operand format to target
format, and assigns IDs - ID assigned to each instruction indicates
physical location - The microarchitecture can interpret this ID in
many different ways - To schedule well, the scheduler must understand
how the microarchitecture translates ID -gt
Physical location - TASL(target format)
ctrl
R1 read, G1, N5 R2 read, N2, N6
D0
N2 muli, N34, N1 N34 ld, N32 N1
ld, N32 N32 mul, N5 N5 add,
W1 N6 addi, N0 N0 br
D1
D2
D3
W1 write, G1
61Mapping instructions to Physical Locations
- Scheduler converts operand format to target
format, and assigns IDs - ID assigned to each instruction indicates
physical location - The microarchitecture can interpret this ID in
many different ways - To schedule well, the scheduler must understand
how the microarchitecture translates ID -gt
Physical location - TASL(target format)
ctrl
R1 read, G1, N5 R2 read, N2, N6
D0
N2 muli, N34, N1 N34 ld, N32 N1
ld, N32 N32 mul, N5 N5 add,
W1 N6 addi, N0 N0 br
D1
D2
D3
W1 write, G1
62Mapping instructions to Physical Locations
- Scheduler converts operand format to target
format, and assigns IDs - ID assigned to each instruction indicates
physical location - The microarchitecture can interpret this ID in
many different ways - To schedule well, the scheduler must understand
how the microarchitecture translates ID -gt
Physical location - TASL(target format)
ctrl
R1 read, G1, N5 R2 read, N2, N6
D0
N2 muli, N34, N1 N34 ld, N32 N1
ld, N32 N32 mul, N5 N5 add,
W1 N6 addi, N0 N0 br
D1
D2
D3
W1 write, G1
63Simulated Annealing Over Time
64Simulated Annealing
- Cost function Simulated cycles
- Prune space further with critical path tool
Guided vs. unguided Annealing for memset_hand
65Contention
- ALU contention
- Local (within a block) - Estimate temporal
schedule - Global (between blocks) - Probabilistic - use
weighting function - Network link contention
- Precise measurements too inaccurate
- Estimate with threshold, weighting function
- Weight network link and global ALU contention
based on annealed results
criticality concurrency
weight (1 - fullness) (1 -
)
66Global Register Prioritization
- Problem Any register dependence may be
important with speculative execution - Solution Extend path lengths through registers
- Register prioritization
- Schedule smaller loops before larger loops
- Schedule loop-carried dependences first
- Extend placement cost through registers to
previous/next block
67Path Volume Scheduling
- Problem The basic SPS algorithm does not
account for the number of instructions in the
path - Solution Perform a depth-first search with
iterative deepening to find the shortest path
that holds all instructions