Compiling for EDGE Architectures: The TRIPS Prototype Compiler - PowerPoint PPT Presentation

About This Presentation
Title:

Compiling for EDGE Architectures: The TRIPS Prototype Compiler

Description:

ASIC/FPGA - similar in that it often involves mapping a dataflow graph onto a partiitoned substrate, but ... Backend Compiler Flow Correctness: ... – PowerPoint PPT presentation

Number of Views:89
Avg rating:3.0/5.0
Slides: 68
Provided by: Katheri172
Category:

less

Transcript and Presenter's Notes

Title: Compiling for EDGE Architectures: The TRIPS Prototype Compiler


1
Compiling for EDGE ArchitecturesThe TRIPS
Prototype Compiler
Kathryn McKinley Doug Burger, Steve Keckler, Jim
Burrill1, Xia Chen, Katie Coons, Sundeep
Kushwaha, Bert Maher, Nick Nethercote, Aaron
Smith, Bill Yoder et al. The University of Texas
at Austin 1University of Massachusetts, Amherst
2
Technology Scaling Hitting the Wall
Qualitatively
Analytically
35 nm
70 nm
100 nm
130 nm
20 mm chip edge
Either way Partitioning for on-chip
communication is key
3
OO SuperScalars Out of Steam
  • Clock ride is over
  • Wire and pipeline limits
  • Quadratic out-of-order issue logic
  • Power, a first order constraint
  • Problems for any architectural solution
  • ILP - instruction level parallelism
  • Memory and on-chip latency
  • Major vendors ending processor lines

4
OO SuperScalars Out of Steam
  • Clock ride is over
  • Wire and pipeline limits
  • Quadratic out-of-order issue logic
  • Power, a first order constraint
  • Problems for any architectural solution
  • ILP - instruction level parallelism
  • Memory and on-chip latency
  • Major vendors ending processor lines

Whats next?
5
Post-RISC Solutions
  • CMP - An evolutionary path
  • Replicate what we already have 2 to N times on a
    chip
  • Coarse grain parallelism
  • Exposes the resources to the programmer and
    compiler
  • Explicit Data Graph Execution (EDGE)
  • 1. Program graph is broken into sequence of
    blocks
  • Blocks commit atomically or not - a block never
    partially commits
  • 2. Dataflow within a block, ISA support for
    direct producer-consumer communication
  • No shared named registers (point-to-point
    dataflow edges only)
  • Memory is still a shared namespace
  • The blocks dataflow graph (DFG) is explicit in
    the architecture

6
Outline
  • TRIPS Execution Model ISA
  • TRIPS Architectural Constraints
  • Compiler Structure
  • Spatial Path Scheduling

7
Block Atomic Execution Model
TRIPS block Flow Graph
Dataflow Graph
Execution Substrate
add add ld cmp
read
Register File
write
read
read
shl ld cmp br
ld shl sw br
Data Caches
write
read
sw sw add br
write
  • TRIPS block - single entry constrained hyperblock
  • Dataflow execution w/ target position encoding

8
TRIPS Block Constraints
Register banks
  • Fixed Size 128 instructions
  • Padded with no-ops if needed
  • Load/Store Identifiers 32 load or store queue
    identifiers
  • More than 32 static loads and stores is possible
  • Registers 32 reads and 32 writes, 8 to each of
    4 banks (in addition to 128)

32 reads 32 writes
32 loads
32 stores
1 - 128 instruction DFG
Memory
Memory
PC read
PC
terminating branch
PC
  • Constant Output all stores and writes execute,
    one branch
  • Simplifies hardware logic for detecting block
    completion
  • Every path of execution through a block must
    produce the same stores and register writes

Simplifies the hardware, more work for the
compiler
9
Compiler Phases (Classic)
PRE Global Value Numbering Scalar
Replacement Global Variable Replacement SCC Copy
Propagation Array Access Strength
Reduction LICM Tree Height Reduction Useless Copy
Removal Dead Variable Elimination
Scale Compiler (UTexas/UMass)
C
FORTRAN
Frontend
Inlining Unrolling/Flattening Scalar Optimizations
Code Generation
  • TIL TRIPS Intermediate Language - RISC-like
    three-address form
  • TASL TRIPS Assembly Language - dataflow target
    form w/ locations encoded

Alpha
TRIPS TIL
SPARC
PPC
10
Backend Compiler Flow
Hyperblock Formation
If-conversion Loop peeling While loop
unrolling Instruction merging Predicate
optimizations
TIL
Resource Allocation
Register allocation Reverse if-conversion
split Load/Store ID assignment SSA for constant
outputs
Scheduling
Fanout insertion Instruction placement Target
form generation
TASL
11
CorrectnessProgressively Satisfy Constraints
Constraint 128 instructions 32 load/store
IDs 32 reg. read/write (8 per 4 banks) constant
output
Hyperblock Formation
If-conversion Loop peeling While loop
unrolling Instruction merging Predicate
optimizations
TIL
Resource Allocation
Register allocation Reverse if-conversion
split Load/Store ID assignment SSA for constant
outputs
Scheduling
Fanout insertion Instruction placement Target
form generation
TASL
12
Predication Hyperblock Formation
  • Predication
  • Convert control dependence to data dependence
  • Improves instruction fetch bandwidth
  • Eliminates branch mispredictions
  • Adds overhead
  • Any instruction can have a predicate, but...
  • Predicate head (low power) or bottom
    (speculative)
  • Hyperblock
  • Scheduling region (set of basic blocks)
  • Single entry, multiple exit, predicated
    instructions
  • Expose parallelism w/o over saturating resources
  • Must satisfy block constraints

head
bottom
P
P
13
Accuracy?
Constraint 128 instructions 32 load/store
IDs 32 reg. read/write (8 per 4 banks) constant
output
Hyperblock Formation
If-conversion Loop peeling While loop
unrolling Instruction merging Predicate
optimizations
TIL
Resource Allocation
Register allocation Reverse if-conversion
split Load/Store ID assignment SSA for constant
outputs
Scheduling
Fanout insertion Instruction placement Target
form generation
TASL
14
Block Atomic Execution Model
TRIPS block Flow Graph
Dataflow Graph
Execution Substrate
add add ld cmp
read
Register File
write
read
read
shl ld cmp br
ld shl sw br
Data Caches
write
read
sw sw add br
write
TRIPS block - single entry constrained
hyperblock Dataflow execution w/ target position
encoding
15
Spatial Scheduling Problem
Partitioned microarchitecture
16
Spatial Scheduling Problem
Partitioned microarchitecture
Anchor points
17
Spatial Scheduling Problem
Balance latency and concurrency
Partitioned microarchitecture
Anchor points
18
Outline
  • Background
  • Spatial Path Scheduling
  • Simulated Annealing
  • Extending SPS
  • Conclusions and Future Work

19
Dissecting the Problem
  • Scheduling can have two components
  • Placement Where an instruction executes
  • Issue When an instruction executes

Placement
Issue
EDGE
20
Explicit Data Graph Execution
  • Block-atomic execution
  • Instruction groups fetch, execute, and commit
    atomically
  • Direct instruction communication
  • Explicitly encode dataflow graph by specifying
    targets

RISC
EDGE
R5
R6
R4
add r1, r4, r5 add r2, r5, r6 add r3, r1, r2
i1 add i3 i2 add i3 i3 add i4
add
add
Centralized Register File
add
21
Scheduling for TRIPS
  • TRIPS ISA
  • Up to 128 instructions/block
  • Any instruction can be in any slot
  • TRIPS microarchitecture
  • Up to 8 blocks in flight
  • 1 cycle latency between
  • adjacent ALUs
  • Known
  • Execution latencies
  • Lower bound for
  • communication latency
  • Unknown (estimated)
  • Memory access latencies
  • Resource conflicts

Register File
Data Cache
22
Scheduling for TRIPS
  • TRIPS ISA
  • Up to 128 instructions/block
  • Any instruction can be in any slot
  • TRIPS microarchitecture
  • Up to 8 blocks in flight
  • 1 cycle latency between
  • adjacent ALUs
  • Known
  • Execution latencies
  • Lower bound for
  • communication latency
  • Unknown
  • Memory access latencies
  • Resource conflicts

Register File
Data Cache
23
Greedy Scheduling for TRIPS
  • GRST PACT 04 Based on VLIW list-scheduling
  • Augmented with five heuristics
  • Prioritizes critical path (C)
  • Reprioritizes after each placement (R)
  • Accounts for data cache locality (L)
  • Accounts for register output locality (O)
  • Load balancing for local issue contention (B)
  • Drawbacks
  • Unnecessary restrictions on scheduling order
  • Inelegant and overly specific
  • Replace heuristics with elegant approach designed
    for spatial scheduling

24
Greedy Scheduling for TRIPS
  • GRST PACT 04 Based on VLIW list-scheduling
  • Augmented with five heuristics
  • Prioritizes critical path (C)
  • Reprioritizes after each placement (R)
  • Accounts for data cache locality (L)
  • Accounts for register output locality (O)
  • Load balancing for local issue contention (B)
  • Drawbacks
  • Unnecessary restrictions on scheduling order
  • Inelegant and overly specific
  • Replace heuristics with elegant approach designed
    for spatial scheduling

25
Outline
  • Background
  • Spatial Path Scheduling
  • Simulated Annealing
  • Extending SPS
  • Conclusions and Future Work

26
Spatial Path Scheduling Overview
Scheduler
Dataflow Graph
Placement
Topology
27
Spatial Path Scheduling Overview
R1
add
mul
Scheduler
Dataflow Graph
ctrl
D0
D1
R2
mul
ld
ld
Placement
Topology
28
Spatial Path Scheduling Overview
Scheduler
Dataflow Graph
Placement
Topology
29
Spatial Path Scheduling Overview
  • Initialize all known anchor points
  • Until all instructions are scheduled
  • Populate the open list
  • Find placement costs
  • Choose the minimum cost location
  • Schedule the instruction whose minimum placement
    cost is largest
  • (Choose the max of the mins)

30
Spatial Path Scheduling Example
  • Initialize all known anchor points

Register File
Data Cache
31
Spatial Path Scheduling Example
  • Populate the open list
  • (marked in yellow)

Open list Instructions that are candidates for
scheduling We include Instructions with no
parents, or with at least one placed parent
32
Spatial Path Scheduling Example
  • Calculate placement cost for
  • each instruction in the open
  • list at each slot

Placement cost(i,slot) Longest path length
through i if placed at slot cost inputCost
execCost outputCost (includes communication and
execution latencies)
33
Spatial Path Scheduling Example
  • Calculate placement cost for
  • each instruction in the open
  • list at each slot

5
Register File
ctrl
R1
R2
1 cycle
D0
mul E1
3
3 cycles
D1
5 cycles
Data Cache
1
Total placement cost 16 3 3 22
34
Spatial Path Scheduling Example
  • Calculate placement cost for
  • each instruction in the open
  • list at each slot

mul mul mul mul
24 24 22 24
22 22 22 24
24 24 24 28
26 26 26 28
add add add add
10 8 8 10
10 10 10 12
12 12 12 14
14 14 14 16
Register File
mul mul mul mul
22 22 24 26
22 22 24 26
24 24 26 28
26 26 28 30
Data Cache
add add add add
22 22 24 26
22 22 24 26
24 24 26 28
26 26 28 30
35
Spatial Path Scheduling Example
  • Choose the minimum cost
  • location for each instruction

mul mul mul mul
24 24 22 24
22 22 22 24
24 24 24 28
26 26 26 28
add add add add
10 8 8 10
10 10 10 12
12 12 12 14
14 14 14 16
Register File
mul mul mul mul
22 22 24 26
22 22 24 26
24 24 26 28
26 26 28 30
Data Cache
add add add add
22 22 24 26
22 22 24 26
24 24 26 28
26 26 28 30
36
Spatial Path Scheduling Example
  • Break ties
  • Example heuristics
  • Links consumed
  • ALU utilization

mul mul mul mul
24 24 22 24
22 22 22 24
24 24 24 28
26 26 26 28
add add add add
10 8 8 10
10 10 10 12
12 12 12 14
14 14 14 16
Register File
mul mul mul mul
22 22 24 26
22 22 24 26
24 24 26 28
26 26 28 30
Data Cache
add add add add
22 22 24 26
22 22 24 26
24 24 26 28
26 26 28 30
37
Spatial Path Scheduling Example
  • Place the instruction with the
  • highest minimum cost
  • (Choose the max of the mins)

mul mul mul mul
24 24 22 24
22 22 22 24
24 24 24 28
26 26 26 28
add add add add
10 8 8 10
10 10 10 12
12 12 12 14
14 14 14 16
Register File
ctrl
R1
R2
D0
mul
mul mul mul mul
22 22 24 26
22 22 24 26
24 24 26 28
26 26 28 30
D1
Data Cache
add add add add
22 22 24 26
22 22 24 26
24 24 26 28
26 26 28 30
38
Spatial Path Scheduling Algorithm
  • Schedule (block, topology)
  • initialize known anchor points
  • while (not all instructions scheduled)
  • for each instruction in open list, i
  • for each available location, n
  • calculate placement cost for (i, n)
  • keep track of n with min placement cost
  • keep track of i with highest min placement
    cost
  • schedule i with highest min placement cost
  • Per-block complexity
  • SPS O(i2 n) i of instructions
  • n of ALUs
  • GRST O(i2 i n)
  • Exhaustive search i !

39
SPS Benefits and Limitations
  • Benefits
  • Automatically exploits known communication
    latencies
  • Designed for spatial scheduling
  • Minimizes critical path length at each step
  • Naturally encompasses four of five GRST
    heuristics
  • Limitations of basic algorithm
  • Does not account for resource contention
  • Uses no global information
  • Minimum communication latencies may be optimistic

40
Experimental Methodology
  • 26 hand-optimized microbenchmarks
  • Extracted from SPEC2000, EEMBC, Livermore Loops,
  • MediaBench, and C libraries
  • Average dynamic instructions fetched/block 67.3
    (Ranges from
  • 14.5 to 117.5)
  • Cycle-accurate simulator
  • Within 4 of RTL on average
  • Models communication and contention delays
  • Comparison points
  • Greedy Scheduling for TRIPS (GRST)
  • Simulated annealing

41
SPS Performance
Geometric mean of speedup over GRST 1.19
Basic SPS
42
SPS Performance
Geometric mean of speedup over GRST 1.19
Basic SPS
43
SPS Performance
Geometric mean of speedup over GRST 1.19
Basic SPS
44
Outline
  • Background
  • Spatial Path Scheduling
  • Simulated Annealing
  • Extending SPS
  • Conclusions and Future Work

45
How well can we do?
  • Simulated annealing
  • Artificial intelligence search technique
  • Uses random perturbations to avoid local optima
  • Approximates a global optimum
  • Cost function simulated cycles
  • Uncertainty makes static cost functions
    insufficient
  • Best cost function
  • Purpose
  • Optimization
  • Discover performance upper bound
  • Tool to improve scheduler

46
Speedup with Simulated Annealing
Geometric mean of speedup over GRST Basic SPS
1.19 Annealed 1.40
Basic SPS
Annealed
47
Speedup with Simulated Annealing
Geometric mean of speedup over GRST Basic SPS
1.19 Annealed 1.40
Basic SPS
Annealed
48
Speedup with Simulated Annealing
Geometric mean of speedup over GRST Basic SPS
1.19 Annealed 1.40
Basic SPS
Annealed
49
Outline
  • Background
  • Spatial Path Scheduling
  • Simulated Annealing
  • Extending SPS
  • Conclusions and Future Work

50
Extending SPS
  • Contention
  • Network link contention
  • Local and Global ALU contention
  • Global register prioritization
  • Path volume scheduling

51
ALU Contention
  • What if two instructions are ready to execute on
    the same ALU at the same time?

read R2
add
mul
Register File
br
ld
ld
D0
D2
ctrl
read R1
mul
Data Cache
add
write R1
52
Local vs. Global ALU Contention
  • Local ALU contention
  • Keep track of expected issue time
  • Increase placement cost if conflict occurs
  • Global ALU contention
  • Resource utilization in previous/next block
  • Weighting function
  • Modify placement cost

53
Speedup over GRST
Geometric mean of speedup over GRST Basic SPS
1.19 SPS extended 1.31
Annealed 1.40
Basic SPS
Annealed
SPS extended
54
Speedup over GRST
Geometric mean of speedup over GRST Basic SPS
1.19 SPS extended 1.31
Annealed 1.40
Basic SPS
Annealed
SPS extended
55
Speedup over GRST
Geometric mean of speedup over GRST Basic SPS
1.19 SPS extended 1.31
Annealed 1.40
Basic SPS
Annealed
SPS extended
56
Related Work
  • Scheduling for VLIW Ellis, Fisher
  • Scheduling for other partitioned architectures
  • Partitioned VLIW Gilbert, Kailas, Kessler,
    Ɩzer, Qian, Zalamea
  • RAW Lee
  • Wavescalar Mercaldi
  • ASIC and FPGA place and route Paulin
  • Resource conflicts known statically
  • Substrate may not be fixed
  • Simulated annealing Betz

57
Conclusions and Future Work
  • Future work
  • Register allocation
  • Memory placement
  • Reliability-aware scheduling
  • Conclusions
  • General spatial instruction scheduling algorithm
  • Reasons explicitly about anchor points
  • Performance within 4 of annealed results

58
Questions?
59
Mapping instructions to Physical Locations
  • Scheduler converts operand format to target
    format, and assigns IDs
  • ID assigned to each instruction indicates
    physical location
  • The microarchitecture can interpret this ID in
    many different ways
  • To schedule well, the scheduler must understand
    how the microarchitecture translates ID -gt
    Physical location
  • TIL (operand format) TASL(target format)
  • read t0, g1
  • read t1, g2
  • muli t2, t1, 4
  • ld t3, 0(t2)
  • ld t4, 4(t2)
  • mul t5, t3, t4
  • add t6, t5, t0
  • addi t7, t1, 8
  • br t7
  • write g1, t6

R1 read, G1, N5 R2 read, N2, N6
N2 muli, N34, N1 N34 ld, N32 N1
ld, N32 N32 mul, N5 N5 add,
W1 N6 addi, N0 N0 br
Scheduler
W1 write, G1
60
Mapping instructions to Physical Locations
  • Scheduler converts operand format to target
    format, and assigns IDs
  • ID assigned to each instruction indicates
    physical location
  • The microarchitecture can interpret this ID in
    many different ways
  • To schedule well, the scheduler must understand
    how the microarchitecture translates ID -gt
    Physical location
  • TASL(target format)

ctrl
R1 read, G1, N5 R2 read, N2, N6
D0
N2 muli, N34, N1 N34 ld, N32 N1
ld, N32 N32 mul, N5 N5 add,
W1 N6 addi, N0 N0 br
D1
D2
D3
W1 write, G1
61
Mapping instructions to Physical Locations
  • Scheduler converts operand format to target
    format, and assigns IDs
  • ID assigned to each instruction indicates
    physical location
  • The microarchitecture can interpret this ID in
    many different ways
  • To schedule well, the scheduler must understand
    how the microarchitecture translates ID -gt
    Physical location
  • TASL(target format)

ctrl
R1 read, G1, N5 R2 read, N2, N6
D0
N2 muli, N34, N1 N34 ld, N32 N1
ld, N32 N32 mul, N5 N5 add,
W1 N6 addi, N0 N0 br
D1
D2
D3
W1 write, G1
62
Mapping instructions to Physical Locations
  • Scheduler converts operand format to target
    format, and assigns IDs
  • ID assigned to each instruction indicates
    physical location
  • The microarchitecture can interpret this ID in
    many different ways
  • To schedule well, the scheduler must understand
    how the microarchitecture translates ID -gt
    Physical location
  • TASL(target format)

ctrl
R1 read, G1, N5 R2 read, N2, N6
D0
N2 muli, N34, N1 N34 ld, N32 N1
ld, N32 N32 mul, N5 N5 add,
W1 N6 addi, N0 N0 br
D1
D2
D3
W1 write, G1
63
Simulated Annealing Over Time
64
Simulated Annealing
  • Cost function Simulated cycles
  • Prune space further with critical path tool

Guided vs. unguided Annealing for memset_hand
65
Contention
  • ALU contention
  • Local (within a block) - Estimate temporal
    schedule
  • Global (between blocks) - Probabilistic - use
    weighting function
  • Network link contention
  • Precise measurements too inaccurate
  • Estimate with threshold, weighting function
  • Weight network link and global ALU contention
    based on annealed results

criticality concurrency
weight (1 - fullness) (1 -
)
66
Global Register Prioritization
  • Problem Any register dependence may be
    important with speculative execution
  • Solution Extend path lengths through registers
  • Register prioritization
  • Schedule smaller loops before larger loops
  • Schedule loop-carried dependences first
  • Extend placement cost through registers to
    previous/next block

67
Path Volume Scheduling
  • Problem The basic SPS algorithm does not
    account for the number of instructions in the
    path
  • Solution Perform a depth-first search with
    iterative deepening to find the shortest path
    that holds all instructions
Write a Comment
User Comments (0)
About PowerShow.com