xPilot: A Platform-Based System-Level Synthesis for Reconfigurable SOCs - PowerPoint PPT Presentation

Loading...

PPT – xPilot: A Platform-Based System-Level Synthesis for Reconfigurable SOCs PowerPoint presentation | free to download - id: 27605e-ZDc1Z



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

xPilot: A Platform-Based System-Level Synthesis for Reconfigurable SOCs

Description:

Quick evaluation of different hardware/software boundaries ... (A,C)=1, two data transfers share one connection (C,D)=2. Resource Binding Problem for DRFM ... – PowerPoint PPT presentation

Number of Views:29
Avg rating:3.0/5.0
Slides: 27
Provided by: cadlab
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: xPilot: A Platform-Based System-Level Synthesis for Reconfigurable SOCs


1
xPilot A Platform-Based System-Level Synthesis
for Reconfigurable SOCs
  • Prof. Jason Cong
  • cong_at_cs.ucla.edu
  • UCLA Computer Science Department

2
Motivation
  • Design complexity is outgrowing the traditional
    RTL method even in current CMOS technologies
  • Nanotechnology will enable 10-100x increase in
    device density and degree of integration
  • Need to enable higher level of design abstraction
  • Start from behavior descriptions (e.g. C or
    SystemC)
  • Use and/or re-use more complex functional unit
    (e.g. processor cores instead of standard cells)

3
ESL Tools A Lot of Interests
4
xPilot Platform-Based Synthesis System
SystemC/C
Platform Description Constraints
xPilot
xPilot Front End
Profiling
SSDM(System-Level Synthesis Data Model)
Analysis
Mapping
Processor Architecture Synthesis
Behavioral Synthesis
Interface Synthesis
Custom Logic
Drivers Glue Logic
Processor Cores Executables
FPSoC
  • Uniqueness of xPilot
  • Platform-based synthesis and optimization
  • Communication-centric synthesis with interconnect
    optimization

5
xPilot Behavioral-to-RTL Synthesis Flow
  • Presynthesis optimizations
  • Loop unrolling/shifting
  • Strength reduction / Tree height reduction
  • Bitwidth analysis
  • Memory analysis

Behavioral spec. in C/SystemC
Platform description
Frontendcompiler
  • Core synthesis optimizations
  • Scheduling
  • Resource binding, e.g., functional unit binding
    register/port binding

SSDM
  • ?Arch-generation RTL/constraints generation
  • Verilog/VHDL/SystemC
  • FPGAs Altera, Xilinx
  • ASICs Magma, Synopsys,

RTL constraints
FPGAs/ASICs
6
System-Level Exploration Using xPilot for
Heterogeneous MPSoC Platforms
  • Heterogeneous MPSoCs exploration
  • Processors
  • Heterogeneous vs. homogeneous
  • General-purpose vs. application-specific
  • On-chip communication architecture (OCA)
  • Bus (e.g. AMBA, CoreConnect), packet switching
    network (e.g. Alpha 21364)
  • Memory hierarchy

µP
µP
µP
µP
µP
IP
µP
µP
µP
µP
FPGA
DSP
tasks
tasks
tasks
OS Driver
OS Driver
OS Driver
Network Interface
Network Interface
Network Interface
Network Interface
Network Interface
Network Interface
Network Interface
Network Interface
Network Interface
Network Interface
Network Interface
Network Interface
Communication Network
7
Outline
  • xPilot Overview
  • Behavior-level synthesis in xPilot
  • System-level synthesis in xPilot
  • Recent Progress in xPilot
  • Interface synthesis
  • Resource binding based on distributed register
    architecture
  • Conclusions

8
Advantage of Behavior Synthesis
  • Shorter verification/simulation cycle
  • Better complexity management, faster time to
    market
  • Rapid system exploration
  • Quick evaluation of different hardware/software
    boundaries
  • Fast exploration of multiple micro-architecture
    alternatives
  • Higher quality of results
  • Platform-based synthesis optimization
  • Full consideration of physical reality

9
Example Better Complexity Management
  • Shorter verification/simulation cycle
  • Simulation speed 100X faster than RTL-based
    method NEC, ASPDAC04
  • Significant code size reduction
  • RTL design 300KL ? Behavioral design 40KL NEC,
    ASPDAC04
  • VHDL code generated by UCLA xPilot targeting
    Altera Stratix platform
  • Over 10x code size reduction can be achieved

10
Unique Features of xPilot (1) Platform-based
Synthesis Optimization
  • Platform-based synthesis optimization
  • The quality of a RTL design is platform-dependent
  • Designers often lack the complete and detail
    knowledge of the target platform

(0,0)
Resource Area Delay (ns)
ADDSUB-24b 25 LUTs 2.27
ADDSUB-32b 33 LUTs 2.61
MUX8to1-24b 120 LUTs 2.92
MUX16to1-24b 264 LUTs 4.658
DSPMUL-18bx18b 2 DSP Blocks 3.833
DSPMUL-24bx24b 8 DSP Blocks 7.688
0.58 1.8 2.8
2.0 2.9 3.7
2.8 3.8 4.7
(95,61)
3X3 Delay Matrix
  • Platform Altera Stratix
  • RTL synthesis place-and-route Altera QuartusII
    v5.0

11
Unique Features of xPilot (2) Communication-Cent
ric Synthesis Optimization
  • System performance power is dominated by
    interconnect
  • It is difficult for designers to consider
    physical layout at the RT level

mul1 (2,4,5)
mul2 (3,6)
gt
T
F
add1
5
2, 3
Binding solution 1 Both multipliers keep active
Data transfer
add2
6
4
mul1
mul1 (2,5,6)
mul2 (3,4)
C2
lt
mul2
Layout-aware performance optimization Overlap
computation with communication
Binding solution 2 mul2 can be powered off when
false branch is taken
Layout-aware power optimization
12
Unique Features of xPilot (3) Highly Scalable
and Optimized Synthesis Algorithms
  • Use of highly scalable and optimized synthesis
    algorithms for best quality of results
  • Interface synthesis Simultaneous data and
    communication scheduling for latency minimization
  • Scheduling A unified framework for
    multi-constraints and multi-objective scheduling
    based on the system of difference constraints
    (SDC)
  • Resource binding Use of distributed register
    architectures for interconnect/communication
    optimization
  • Power optimization Optimal functional module
    and voltage binding

13
Behavior and Communication Co-Optimization for
Systems with SCM
  • SCM Sequential Communication Media
  • FIFOs (e.g., Xilinx FSLs), Buses (e.g., Xilinx
    CoreConnect. Altera Avalon, etc.)
  • Data must be read and written in the same order
  • Order may have dramatic impact on performance
  • Best order should guarantee that no data
    transmission on critical path are delayed by
    non-critical transmission

C
for (int i0 i lt8 i) S1 datai
int s07 data0 data7 Int s16 data1
data6..
data8
P2
P1
FIFO
Custom Logic 1
Custom logic 2
PE2
PE1
DCT example
14
SCM Co-Optimization ? Problem Formulation
  • Given
  • A set of processes P connected by a set of
    channels in C
  • A set of data D d1, d2, , dm to be
    transmitted on each channel cj,
  • Goal
  • Find the optimal transmission order of each
    process, so that the overall latency of the
    process network is minimized subject to the given
    design constraints and platform specifications
  • In the meantime, generate the drivers and glue
    logics for each process automatically

15
Proposed SCM Co-Optimization Design Flow
Platform Description Constraints
Process Network
Front End
System-Level Synthesis Data Model
SCOOP (SCM CO-Optimization)
Communication order detection
Code transformation and interface generation
Indices compression for loop reordering
Drivers Glue Logics
Process Behavior
16
Communication Order Detection
  • Step 1. Construct a global CDFG by merging the
    individual CDFGs of each process
  • Step 2. Solve a resource-constrained min-latency
    scheduling problem to optimize the total latency
    of the global CDFG

17
Loop Indices Compression
  • Given the optimal order, we try to generate
    restructured loops for code compression
  • i.e., given the original iteration and reordered
    iteration, find the minimum number of linear
    intervals to represent the new iteration space

Original order (0,0), (0,1), (1,0), (1,1) After
reordering (0,0), (1,0), (0,1), (1,1) Need to
solve the linear system Solution ij, j
i
18
Preliminary Experimental Results
  • Experimental setting
  • Target communication model two-process
    producer-consumer model
  • Behavioral synthesizer UCLA xPilot
  • RTL simulator Mentor ModelSim

Total latency (Cycle) Total latency (Cycle) Total latency (Cycle) RAs Compress RAs Compress
Designs Trad. SCOOP Reduction Before After
DCT1 325 290 10.77 0 0
Haar 142 134 5.63 0 0
DWT 689 617 10.45 0 0
Mat_mul 408 339 16.91 96 20
DCT2 483 419 13.25 80 64
Masking 620 420 32.26 192 0
Dot 1903 1084 43.04 300 0
An average of 26 improvement in total latency
can be achieved.
19
Advantage of Register-File Microarchitectures
  • (a) A scheduled DFG with register binding
    indicated on each variable
  • (b) Binding using discrete registers
  • (c) Binding using a register file

20
Distributed Register-File Microarchitecture
FP-SoC
Island A
Island C
Island B
Xilinx XC-2V 2000 3000 4000 6000 8000
18Kb BRAM 56 96 120 144 168
Dist. RAM(Kb) 336 448 720 1,056 1,456
Altera EP1 S25 S30 S40 S60 S80
M512(512b) 224 295 384 574 767
M4K(4Kb) 138 171 183 292 364
M-(512Kb) 2 4 4 6 9
On-chip RAM resource(Virtex II and Stratix)
21
Resource Binding for DRFM
  • Facts under simplified assumptions
  • Operations bound onto an island form a chain in
    the given scheduled DFG
  • Inter-chain data transfers may share a physical
    inter-island connection
  • The number of inter-island connections is crucial
    to the QoR of a DRFM instance
  • Inter-island connections
  • (A,B)(A,D)1
  • (A,C)1, two data transfers share one connection
  • (C,D)2

22
Resource Binding Problem for DRFM
  • General DRFM binding problem
  • Given scheduled DFG G and DRFM M, to find a
    feasible resource binding B(G,M), so that the
    quality of B is optimized.
  • Hard to characterize the quality of binding
    solution B
  • The problem is too ad-hoc
  • Relaxed problem DRFM Binding for Minimizing
    Inter-Island Connections
  • Given a scheduled DFG G and DRFM M, to find a
    feasible resource binding B(G,M), so that the
    total number of inter-island connections of B is
    minimized.
  • Solution control-step by step binding with
    min-cost bipartite matching

23
Three Experimental Flows for Comparison
xPilot Frontend
xPilot behavioral synthesis system
SSDM/CDFG
Scheduling algorithms
Scheduled CDFG (STG)
2) Baseline (Random) DRFM Binding
3) DRFM Binding for Minimizing Inter-Island
Connections
1) Binding on Discrete-Register
Microarchitecture
RTL generation
Xilinx Virtex II
24
Experimental Results
  • Xilinx ISE 7.1 Virtex II Target clock period
    8ns
  • The baseline DRFM binding results achieve 46.70
    slice reduction over the discrete-register
    approach
  • Optimized DRFM binding reduces 12.21 further
  • Overall, more than 2X logic slice reduction with
    better clock period (7.8).

Area (Slices, DRF solutions use on-chip RAM
blocks)
Clock period (ns)
25
Conclusions
  • xPilot can automatically synthesize behavior
    level C or SystemC presentation to RTL code with
    necessary design constraints
  • Platform-based synthesis with physical planning
    provides
  • Shorter verification/simulation cycle
  • Better complexity management, faster time to
    market
  • Rapid system exploration
  • Higher quality of results
  • xPilot can help to explore the efficient use of
    (multiple) on-chip processors
  • xPilot can efficiently optimize the software for
    reconfigurable processors
  • We are interested to engage with selected
    industrial partners to further validate and
    enhance the technology

26
Acknowledgements
  • We would like to thank the supports from
  • National Science Foundation (NSF)
  • Gigascale Systems Research Center (GSRC)
  • Semiconductor Research Corporation (SRC)
  • Industrial sponsors under the California MICRO
    programs (Altera, Xilinx)
  • Team members

Yiping Fan
Zhiru Zhang
Wei Jiang
Guoling Han
About PowerShow.com