Title: Layout Driven Data Communication Optimization for High Level Synthesis
1Layout Driven Data Communication Optimization for
High Level Synthesis
Adam Kaplan, Philip Brisk and Majid Sarrafzadeh
Computer Science Department University of
California, Los Angeles
- Ryan Kastner, Wenrui Gong,
- Xin Hao, Forrest Brewer
- Dept. of Electrical and
- Computer Engineering
- University of California,
- Santa Barbara
2High Level Synthesis
- Input Application description written in C (C,
SystemC, HandelC, SpecC)
Internal filter of an image convolver
Maximize performance (area, latency, power, )
subject to input constraints
3Target Architectures
- Spatial architectures
- Local control between data path, global data flow
between control nodes - Lots of distributed computational units, memory
- Coarse/fine grained reconfigurable architectures
- Techniques could be used for other architectures
- May not make sense
- Our design flow has little resource sharing
Coarse grain programmable platform
Fine grain configurable platform
4Obligatory Design Flow Slide
5Design Example
int FAST(real b, int n) real fn int i,
in, nn, n2pow, n4pow, nthpo n2pow
fastlog2(n) if(n2pow lt 0) return 0 nthpo
n fn nthpo n4pow n2pow / 2 /
radix 2 iteration required do it now /
if(n2pow 2) nn 2 in n / nn
FR2TR(in, b, b in) else nn 1
- FAST function from MediaBench
- Some nodes missing - simple computation, merged
into others - Lines below show data communication
Node 1
Node 2
Node 3
Node 4
Node 5
/ perform radix 4 iterations / for(i 1 i
lt n4pow i) nn 4 in n / nn
FR4TR(in, nn, b, b in, b 2 in, b 3
in) / perform inplace reordering /
FORD1(n2pow, b) FORD2(n2pow, b) / take
conjugates / for(i 3 i lt n i 2) bi
-bi return 1
Node 6
Node 7
Node 8
Node 9
Node 10
6Characterizing Data Communication
- Examples of data communication schemes
Memory (Register Bank, RAM)
Bus
Control Node 3
Control Node 2
Control Node 2
Control Node 3
Control Node 4
Control Node 4
Distributed
Centralized
Data communication wire
Data communication memory access
7Identifying Data Communication
- Determine relationship between place(s) where
data is defined and where data is used
a ?
b ?
- Naïve method all use-points of a variable
depend on all definitions of that variable - Not all use points use a variable
a ?
b ?
a ?
c ?
? b
? c
? a
Need analysis to minimize the amount of data
communication
8Use of SSA in Compilation
- Must determine relationship between where data is
generated and where data is used - Problem formulations
- DAC03 Minimize the total number of bits
communicated between all pairs of control nodes - Today Minimize overall wirelength
- SSA (Static Single Assignment)
- Changes each variable to have a unique definition
point - Must add ?-nodes to merge definitions
9SSA Fundamentals
- SSA algorithms
- Find location of ?-nodes
- Rename variables
- Three main SSA algorithms
- Minimal, Pruned Cytron et al.
- Semi-pruned Briggs et al.
- Differ in number and location of ?-nodes
- Minimal insert ?-nodes at
- iterated dominance frontier (IDF)
- Semi-pruned insert ?-node at
- IDF if variable live outside some basic block
- Pruned insert ?-node at
- IDF if variable live at that time
10Results SSA for Data Comm. Minimization
- Edge Weight w(i,j) number of bits communicated
from node i to j - Total Edge Weight (TEW) - corresponds to amount
of data communication
MediaBenchmarks
11Further Minimizing Data Communication
- Current SSA algorithms place ?-nodes temporally
- In software compilation, live ranges should be
short - Appropriate in hardware?
Spatial ?-node distribution
Temporal ?-node distribution
a1 ?
b1 ?
a2 ?
b2 ?
a3 ?
c1 ?
? b1
? c1
TEW 3
a4 ? ?(a2,a3)
? a4
12Spatial ?-nodes Distribution Algorithm
- d number of uses of ?-node destination
- s number of ?-node source values
- Number of temporal links
- Number of spatial links
s 3
a3??(a0,a1,a2)
? a3
? a3
d 2
Optimal assuming ideal n-dimensional floorplan
13Physically Aware Compiler Transforms
- Consider layout information during compilation
- Modify transforms to consider physical info
- Ideal full physical synthesis extremely
accurate, but way too time consuming
- Approximate using floorplanning
- Much faster
- Gives good enough high level physical picture
application
Hardware Compilation
- Our previous data comm. work
- No physical information
- Can lead to negative results
Physical Synthesis
14Physically Aware Data Communication
- Modify placement of F-functions to consider
wirelength
?-Placement Algorithm
- Given a CFG Gcfg(Vcfg, Ecfg)
- perform_ssa(Gcfg)
- calculate_def_use_chains(Gcfg)
- remove_back_edges(Gcfg)
- topological_sort(Gcfg)
- foreach vertex v ? Vcfg
- foreach ??-node?? ? v
- s ? ??.sources
- d ? def_use_chain(?.dest)
- IDF ? iterated_dominance_fronter(s)
- PossiblePlacements ?
findPlacementOptions(IDF) - place(?) ?
selectBest(PossiblePlacements) - distribute/duplicate ? to place(?)?
15Algorithm in Action
a1 ?
- Evaluate all options for ?-nodes
- Replicate ? when necessary
- Limit amount of replication - most often leads to
more wirelength - Can play tricks to limit redundant placements
b1 ?
a2 ?
b2 ?
a3 ?
c1 ?
? b1
? c1
Traditional (temporal)
Traditional (temporal)
a4 ? ?(a2,a3)
Any of these options could yield the best
wirelength Highly dependent on the floorplan
a4 ? ?(a2,a3)
a4 ? ?(a2,a3)
a4 ? ?(a2,a3)
a4 ? ?(a2,a3)
a4 ? ?(a2,a3)
Spatial DAC03
Spatial DAC03
? a4
16Algorithm in Action
- FAST function from MediaBench testsuite
17Algorithm in Action
18Full Floorplanning Results
- Simple iterative approach
Spectacularly negative results
- Initial optimization minimizes data communication
- Full SA based floorplanning
- Reoptimization based to minimize floorplanning
- Full SA based floorplanning
19Incremental Floorplanning
- Incremental Placement Coudert et al
- Given an optimized placement and a set of changes
to the netlist (e.g., due to technology
remapping) modify the placement to improve it. - Equally applicable to floorplanning
Initial Floorplan
Modified Floorplan
Perturbations
6
20Our Incremental Floorplanner
Initial Floorplan
Modified Floorplan
Perturbations
6
Incremental Floorplan
32/36 -
Incremental Floorplanner
27/30.4 -
-
1
5/5.6 -
4
16/18 -
-
11/12.4 -
3
2
2/2.3 -
9/10.1 -
21Our Incremental Floorplanner
- Calculate area room of each node bottom up
slicing tree traversal - Area redistribution
- Top down traversal
- Increase area if necessary
- Not enough space at root
- Aspect ratios become too distorted
Simple, yet effective Other more complicated
algorithms might work better
Modified Floorplan
Incremental Floorplan
32/36 -
27/30.4 -
-
1
5/5.6 -
4
16/18 -
-
11/12.4 -
3
2
2/2.3 -
9/10.1 -
22 MediaBench Functions
Benchmark Blocks ? Links Weight Initial WL
1 adpcm coder 33 31 54 2688 35568
2 adpcm decoder 26 23 44 1952 21588
3 internal filter 10 143 60 17088 411637
4 Internal expand 101 94 257 14336 317031
5 compress output 34 17 60 2368 29114
6 mpeg2dec block 62 13 66 2272 34510
7 mpeg2dec vector 16 4 26 1024 4366
8 FAST 14 4 15 704 3714
9 FR4TR 77 87 155 704 340697
10 det 12 5 13 7936 3772
23Incremental Floorplanning Results
Optimal Approach 12 Overall Wirelength
Reduction 25 Phi-node Wirelength Reduction
Normalized Wirelength
Our Approach 6 Overall Wirelength Reduction 8
Phi-node Wirelength Reduction
avg
Benchmarks
24Related Work
- Hardware compilation projects using SSA
- PDGSSA form UCSB
- CASH CMU
- SA-C UCR
- Sea Cucumber BYU
- Physically aware behavioral synthesis techniques
- SA for scheduling, binding and floorplanning
Prabhakaran97 - SA for binding and floorplanning Yung-Ming94
- Scheduling, allocation and binding Dougherty00
- Fasolt bus topology Knapp92
- High level synthesis Tarafdar00
- Incremental CAD
- Problem overview/challenges Coudert00
- Floorplanning Crenshaw99
25Conclusions
- Its been a long strange trip
- SSA a nice IR for hardware compilation
- Explicitly shows data flow
- Useful for exploiting parallelism
- Compiler techniques applied to hardware design
can reduce wirelength - They must be aware of physical information
- They must use an incremental floorplanning
26Questions?