HWSW CoDesign of Embedded Reconfigurable Architectures - PowerPoint PPT Presentation

1 / 42
About This Presentation
Title:

HWSW CoDesign of Embedded Reconfigurable Architectures

Description:

To show Agileware provides better compilation model and better performance than ... profiling, optimizations, auto HW/SW partition. Agileware. Description. Language ... – PowerPoint PPT presentation

Number of Views:48
Avg rating:3.0/5.0
Slides: 43
Provided by: paulle2
Category:

less

Transcript and Presenter's Notes

Title: HWSW CoDesign of Embedded Reconfigurable Architectures


1
HW/SW Co-Design of Embedded Reconfigurable
Architectures
  • Yanbing Li, Tim Callahan, Ervan
    Darnell,Randolph Harr, Uday Kurkure, Jon
    Stockwood
  • Synopsys Inc.
  • EECS, Univ. of California, Berkeley
  • Silicon Spice

2
Outline
  • Nimble compiler overview
  • HW/SW partitioning algorithm
  • Results
  • Conclusions

3
Nimble Project
Retargetable compiler for Agileware
Configurable systems using intelligent tools
  • Agileware
  • CPU Reconfigurable Datapath
  • Memory bus interface
  • Quick configuration
  • Example GARP, ACE

Embedded CPU
SWHW SRAM / CACHE
Reconfigurable Datapath
4
Nimble Project
Retargetable compiler for Agileware
Configurable systems using intelligent tools
  • Off-shelf ANSI C to Agileware
  • Automatic, quick HW/SW partition and compilation
  • Embedded DSP target market
  • To show Agileware provides better compilation
    model and better performance than existing DSP
    / VLIW architectures
  • Agileware
  • CPU Reconfigurable Datapath
  • Memory bus interface
  • Quick configuration
  • Example GARP, ACE

Embedded CPU
SWHW SRAM / CACHE
Reconfigurable Datapath
5
Nimble Compiler Overview
C code
Agileware Description Language
Front-end Compiler profiling, optimizations,
auto HW/SW partition
HW Kernels as DFG
6
Nimble Compiler Overview
C code
Agileware Description Language
Front-end Compiler profiling, optimizations,
auto HW/SW partition
HW Kernels as DFG
ADAPT Datapath Synthesis scheduling, mapping,
floorplanning
F L A M E
Generator Libraries
Auto P R Vendor Tools
Config bit stream
7
Nimble Compiler Overview
C code
Agileware Description Language
Front-end Compiler profiling, optimizations,
auto HW/SW partition
HW Kernels as DFG
ADAPT Datapath Synthesis scheduling, mapping,
floorplanning
F L A M E
Generator Libraries
Auto P R Vendor Tools
C code to run on CPU
Config bit stream
Embedded GCC
Mixed HW/SW executable
8
HW/SW Partitioning the Nimble Approach
  • Spatial vs. temporal partition

k1
k2
k1
Spatial partition
Temporal partition
k2
Reconfigurable HW
9
HW/SW Partitioning the Nimble Approach
  • Spatial vs. temporal partition

k1
k2
k1
Spatial partition
Temporal partition
k2
Reconfigurable HW
  • Loops as HW candidates
  • Focus on small number of dominating loops
  • MPEG2 encoder total 180 loops,top20 contributes
    gt90 time
  • Instruction-level parallelism
  • Compiler transformations generates multiple
    versions
  • Loop transformations, pipelining etc

10
Related Work
  • HW/SW partitioning spatial partitioning
  • Single CPU ASIC
  • COSYMA97, Kalavade et al.94, Wolf94
  • Heterogeneous architectures
  • SOS92, Jha et al. 97, Li Wolf98
  • HW/SW partitioning and compilation for
    reconfigurable architectures
  • Callahan98, Kaul et al. 99, Luk et al.

11
Problem Specification
  • Partition application onto CPU and configurable
    DP
  • Under area constraint
  • Goal maximize overall application performance
  • Assumptions
  • One HW kernel (loop) per configuration
  • HW/SW serial execution

12
Problem Specification
  • Partition application onto CPU and configurable
    DP
  • Under area constraint
  • Goal maximize overall application performance
  • Assumptions
  • One HW kernel (loop) per configuration
  • HW/SW serial execution

Partitioning result
SW version
HW versions
SW
HW
hw1.3
sw1
hw1.2
hw1.1
sw1
Loop1
sw2
hw2.1
hw2.1
Loop2
sw3
hw3.2
hw3.1
hw3.2
Loop3
13
Problem Formulation
HW SW time
HW / SW interface time
Config time
14
Problem Formulation
HW SW time
HW / SW interface time
Config time
  • Need global optimization
  • A loops config time depends on HW/SW
    partitionsof other loops

for (i1,100) /loop A/ for ()
/loop B/ for ()
100
A
B
15
Overview Our Partitioning Alg
Interesting loop selection (gt1)
SW performance profiling
Loop entry trace profiling
Compiler transformations
Loop entry trace for config times
SW times
Quick HW synthesis
HW times /areas
Preprocessing
HW/SW Partitioning
16
Overview Our Partitioning Alg
Interesting loop selection (gt1)
SW performance profiling
Loop entry trace profiling
Compiler transformations
Loop entry trace for config times
SW times
Quick HW synthesis
HW times /areas
Preprocessing
HW/SW Partitioning
Local opt For each loop, intra-loop selection
Global Opt For all loops, inter-loop selection
17
Nimble HW/SW Partitioning Process
Loop1 Loop2 Loop3 Loop4 Loop5 Loop6
AllLoops(99)
18
Nimble HW/SW Partitioning Process
ILD
Loop1 Loop2 Loop3 Loop4 Loop5 Loop6
SW
sw1
sw2
sw3
AllLoops(99)
Top 90Loops
19
Nimble HW/SW Partitioning Process
Transform
ILD
Loop1 Loop2 Loop3 Loop4 Loop5 Loop6
SW
HW
SW
sw1
hw1.3
sw1
hw1.2
hw1.1
sw2
hw2.1
sw2
sw3
hw3.2
hw3.1
sw3
Heuristics to CreateMultiple Kernel Versions
AllLoops(99)
Top 90Loops
20
Nimble HW/SW Partitioning Process
Intra loop selection
Transform
ILD
Loop1 Loop2 Loop3 Loop4 Loop5 Loop6
SW
HW
SW
HW
SW
sw1
hw1.3
sw1
hw1.2
hw1.1
sw1
sw2
hw2.1
hw2.1
sw2
sw2
sw3
hw3.2
hw3.1
sw3
hw3.2
sw3
Heuristics to CreateMultiple Kernel Versions
Per-loop Selection
AllLoops(99)
Top 90Loops
21
Nimble HW/SW Partitioning Process
Intra loop selection
Inter loopselection
Transform
ILD
Loop1 Loop2 Loop3 Loop4 Loop5 Loop6
SW
HW
SW
HW
SW
sw1
sw1
hw1.3
sw1
hw1.2
hw1.1
sw1
sw2
hw2.1
hw2.1
sw2
hw2.1
sw2
sw3
sw3
hw3.2
hw3.1
sw3
hw3.2
sw3
ApplicationLevelSelection
Heuristics to CreateMultiple Kernel Versions
Per-loop Selection
AllLoops(99)
Top 90Loops
22
Intra-Loop Selection
Delay
Design Space for a loop
0
HW area
Area available
  • Quick synthesis of multiple versions of each loop
  • Select fastest HW loop version within area
    constraint

23
Intra-Loop Selection
Delay
Design Space for a loop
sw
hw1
0
HW area
Area available
  • Quick synthesis of multiple versions of each loop
  • Select fastest HW loop version within area
    constraint

24
Intra-Loop Selection
Delay
Design Space for a loop
sw
hw1
hw3
hw2
hw4
0
HW area
Area available
  • Quick synthesis of multiple versions of each loop
  • Select fastest HW loop version within area
    constraint

25
Intra-Loop Selection
Delay
Design Space for a loop
sw
hw1
hw5
hw3
hw2
hw6
hw4
0
HW area
Area available
  • Quick synthesis of multiple versions of each loop
  • Select fastest HW loop version within area
    constraint

26
Intra-Loop Selection
Delay
Design Space for a loop
sw
hw6
0
HW area
Area available
  • Quick synthesis of multiple versions of each loop
  • Select fastest HW loop version within area
    constraint

27
Inter-Loop Selection
  • Select what loops execute in HW, what in SW
  • Approach
  • Divide loops into small clusters, and performs
    optimal selection in each loop cluster.
  • Loop clustering is based on loop-procedure
    hierarchy graph.

28
Inter-Loop Selection
Loop-procedure hierarchy graph
Main
W
I
R
FW
Q
RLE
E
I1
R1
R2
R3
R4
FW1
Q1
RLE1
E1
E4
BQ
FW5
RLE2
E2
FW2
FW3
FW4
FW6
FW7
Q2
Q4
Q5
RLE3
E3
Q3
Q6
Wavelet benchmark shown
29
Inter-Loop Selection
Clustering of dominating loops based on hierarchy
graph
Main
W
I
R
FW
Q
RLE
E
I1
R1
R2
R3
R4
FW1
Q1
RLE1
E1
E4
BQ
FW5
RLE2
E2
FW2
FW3
FW4
FW6
FW7
Q2
Q4
Q5
RLE3
E3
Q3
Q6
30
Inter-Loop Selection
Clustering of dominating loops based on hierarchy
graph
Main
W
I
R
FW
Q
RLE
E
I1
R1
R2
R3
R4
FW1
Q1
RLE1
E1
E4
BQ
FW5
RLE2
E2
FW2
FW3
FW4
FW6
FW7
Q2
Q4
Q5
RLE3
E3
Q3
Q6
31
Inter-Loop Selection
Clustering of dominating loops based on hierarchy
graph
Main
W
I
R
FW
Q
RLE
E
I1
R1
R2
R3
R4
FW1
Q1
RLE1
E1
E4
BQ
FW5
RLE2
E2
FW2
FW3
FW4
FW6
FW7
Q2
Q4
Q5
RLE3
E3
Q3
Q6
32
Inter-Loop Selection
Clustering of dominating loops based on hierarchy
graph
Main
W
I
R
FW
Q
RLE
E
I1
R1
R2
R3
R4
FW1
Q1
RLE1
E1
E4
BQ
FW5
RLE2
E2
FW2
FW3
FW4
FW6
FW7
Q2
Q4
Q5
RLE3
E3
Q3
Q6
33
Inter-Loop Selection
Clustering of dominating loops based on hierarchy
graph
Main
W
I
R
FW
Q
RLE
E
I1
R1
R2
R3
R4
FW1
Q1
RLE1
E1
E4
BQ
FW5
RLE2
E2
FW2
FW3
FW4
FW6
FW7
Q2
Q4
Q5
RLE3
E3
Q3
Q6
34
Inter-Loop Selection
Clustering of dominating loops based on hierarchy
graph
Main
W
I
R
FW
Q
RLE
E
I1
R1
R2
R3
R4
FW1
Q1
RLE1
E1
E4
BQ
FW5
RLE2
E2
FW2
FW3
FW4
FW6
FW7
Q2
Q4
Q5
RLE3
E3
Q3
Q6
35
Inter-Loop Selection
Clustering of dominating loops based on hierarchy
graph
Main
W
I
R
FW
Q
RLE
E
I1
R1
R2
R3
R4
FW1
Q1
RLE1
E1
E4
BQ
FW5
RLE2
E2
FW2
FW3
FW4
FW6
FW7
Q2
Q4
Q5
RLE3
E3
Q3
Q6
36
Optimal Selection in Loop Cluster
  • Inter-loop selection inside each loop cluster
  • Compute configuration cost of all loops in each
    partitioning possibility
  • Configuration cache considered
  • Example

37
Optimal Selection in Loop Cluster
  • Inter-loop selection inside each loop cluster
  • Compute configuration cost of all loops in each
    partitioning possibility
  • Configuration cache considered
  • Example

38
Results
normalized application exec time
(most optimistic assumptions made)
  • Our partitioning alg finds optimal or close-to
    optimal results for benchmarks tested

39
Algorithm Exec Performance
Alg CPU time (sec)
  • KS algorithm fast (lt 2 secs for benchmarks
    tested),not bottleneck of Nimble flow

40
Conclusions
  • Implemented HW/SW partitioning in contextof
    Nimble flow
  • Fully automatic flow
  • Our algorithm is efficient and effective in
    findingclose to optimal HW/SW partitions.
  • Global time optimization
  • SW time, HW time, config time, HW/SW interface
    time
  • SW time profiling and HW quick synthesis are
    essential for evaluate HW/SW tradeoffs.
  • More compiler transformations and heuristics to
    drive transformations.

41
Algorithm Optimality
  • Overall KS algorithm optimality
  • Depends on accuracy of HW and SW time estimation
  • Unimportant loops eliminated
  • Depends on level of loop clustering
  • 1st level guarantees optimality
  • Complexity
  • Select dominating loops
  • Loop entry trace profiling
  • Partitioning

O(l) O(P) O(kn)O(n2)O(2C)
l total loops P app exec time k versions
per loop n dominating loops C max cluster
size, constant
42
Loop Trace Profiling
  • Record entry trace of all HW-feasible loops
  • Lossless information for computing configuration
    time
  • Online trace compression
  • Reduce storage size for trace
  • Analysis based on compressed trace
  • Example MPEG2 encoder
  • 200MB -gt 2KB

Example Wavelet Original trace EDEDED CBCBCB
EDED CBCB EDED CBCB Compressed trace (ED)
64 (CB) 64 (ED) 32 (CB) 32 (ED) 16 (CB) 16
3
E
D
C
B
Write a Comment
User Comments (0)
About PowerShow.com