Title: A Comprehensive Codesign Framework for Embedded Systems Rabi Mahapatra Department of Computer Scienc
1A Comprehensive Codesign Framework for Embedded
SystemsRabi MahapatraDepartment of Computer
Science Engg.Texas AM UniversityApril 2001
2Topics to be covered
- Introduction
- Existing Codesign Framework on ES
- What is lacking in this framework
- Two hot-spots investigation
- Available results
- Other Research topics by Codesign team at Texas
AM - Conclusion
3Introduction
- Hardware-Software Codesign for embedded system
research program by DARPA and NSF
4Introduction
- Hardware-Software Codesign for embedded system
research program by DARPA and NSF - Codesign Definition
- Meeting System level objectives by exploiting
the synergism of hardware and software through
their concurrent design
5Concurrent design
- Concurrent (codesign) flow
Start
Start
SW
HW
HW
SW
Designed by independent groups of experts
Designed by Same group of experts with
cooperation
6Why codesign?
- Reduce time to market
- Achieve better design
- Explore alternative designs
- Good design can be found by balancing the HW/SW
- To meet strict design constraint
- power, size, timing, and performance trade-off
- safety and reliability
- system on chip
7Current Design Framework
8Framework Embedded System Codesign
9Contemporary Co-design framework
System Specification
Front end Compiler
Validation
Behavior Description of Modules
Partitioning
Performance Estimation
Synthesis S/W Commn H/W
Integration
Co-Simulation
Constraint Verification
Implementation CPU ASIC Memory
10Two hot-spots in Codesign
- Partitioning
- Co-simulation
11Current Partitioning
- Binary partitioning prevails
- Manual even in most recent VCC tool
- Power is not considered as a parameter while
partitioning - Problem is NP complete imagine extended
partitioning, multiple-space partitioning!
12Partitioning
- Related Work
- Tiwari, Malik Wolfe, Power Analysis of
Embedded Software A First step towards Software
power minimization, IEEE Trans VLSI, 1994 - Givargis, Henkel Vahid, Interface and Cache
power exploration for core based Embedded
system, ICCAD 1999. - Stiti, Vahid etal, A first step towards an
architecture tuning methodology for low power,
CASES 2000. - Lu, Benine MiCheli, Low-Power Task Scheduling
for Multiple devices, CODES 2000. - Our approach - consider Area, Time and Power for
Design optimization
13Partitioning with Power as a parameter
System Specification in C
Simulation Analysis of Software Implementation
Convert C code to VHDL (ART Builder)
Power Analysis Sp (Power Profiler)
Calculate Sa,St
Hardware Synthesis (Design Compiler)
Power Analysis Hp (Power Compiler)
Calculate Ha,Ht
Optimization Engine
System Partitioned into H/W S/W
14Partitioning - Software Analysis
- Software power for a specific microprocessor
using Power profiler. - Software time using assembly level code and
instruction time profile - Software area using memory size used up in bytes
- SPARClite processor has been used as the target
for this experiment.
15Partitioning - Power Profiler
Input
File
in
Assembly
Total Current 0
Read the file line by line
S
YE
Is
Output
Total
EOF?
Current3.3
NO
Identify Instruction INST
line
from the current
Compare INST with the
Hash table entries
NO
Does INST
Skip the current line
match with
a Hash
Table
YES
Get the corresponding
value of the Current
Current
Total Current
16Partitioning - Hardware synthesis
- Hardware power using Synopsys Power Compiler
- Hardware Area and time using Synopsys Design
Analyzer - Technology in use
17Partitioning - EGLT optimization
- Enumeration-Gray code with Lookup Table
- complete coverage, Most efficient for medium
sized binary partitioning. - Case Studies
- QAM Modem - Has 21 modules for partitioning into
hardware and software modules - Optimization engine optimizes the allocation
taking Area, Power and Timing requirements.
18Partitioning - Case Study
A Ptolemy Snapshot of Modem design
19Software Analysis Modem
- Block Name Area(bytes) Time(ns) Power (mW)
- AddCx 6956 640 873.5
- AddInt 6804 782 885.2
- BitsToInt 7024 1440 840.0
- C2R 6864 626 869.5
- Dist 6868 513 863.3
- FtoI 6956 74340 832.1
- GainInt 6892 9476 886.9
- Gaussian 7304 76469 811.5
- IID Uniform 7108 64282 832.1
- LMSCx 8508 13697 806.1
- ModuloInt 6900 8715 879.7
- Quant 7140 4173 816.6
- R2C 6864 367 878.3
- SubCx 6972 547 873.5
- TableCx 7736 902 845.8
- TableInt 7172 853 848.6
20Hardware SynthesisModem
- Block Name Area(of gates) Time(ns) Power(mW)
- AddCx 594 9.96 40.56
- AddInt 299 9.96 19.6825
- BitsToInt 114 1.07 9.1423
- C2R 0 0 3.1689
- Dist 0 0 3.1689
- FtoI 42 0.43 1.13
- GainInt 0 0 1.5350
- Gaussian 13555 2.634 2W
- IID 13555 2.634 2 W
- LMSCx 80215 30.68 8.0758 W
- ModuloInt 0 0 99.0293 uW
- Quant 59 1.18 4.4644
- R2C 0 0 3.1689
- SubCx 687 9.96 49.2913
- TableCx 332 3.18 21.6056
- TableInt 178 2.54 11.0962
21Functions
22EGLT optimization
- Decimal Binary Gray
- 0 0000 0000
- 1 0001 0001
- 2 0010 0011
- 3 0011 0010
- 4 0100 0110
- 5 0101 0111
- 6 0110 0101
- 7 0111 0100
- 8 1000 1100 etc.
23Results
- Best partition
- HW Modules AddInt, AddCx, TableInt, TableCx,
GainInt, IID Uniform, Modulo, Dist, Bits to Int,
Quant, CtoR, RtoC, SubCx - SW Modules IID Gaussian, LMSCx, Float to Int
- Lowest Power obtained for the constraints set
4619 mW - Hardware Area 16238
- Software Area 23508
- Timing 166682 ns
24EGLT Partitioning Results
25Optimization operations
26(No Transcript)
27Partitioning overheads
- Growing design space due to complex architecture,
technology, and solutions - Number of implementations for mapping a system
specification made of n tasks on an architecture
made of q nonempty modules S(n,q) ? - With p different kind of technology to implement
each module, number of possible implementation
grows to NbArchitecture (n, p) - Example n4, p 2, q 3 Nb 309
28DSE Related work
- PMOSS (1996), LYCOS (1997), COSYMA (1998)
Mono-processor - POLIS (1996) one up and several co-processor
- SpecSyn (1998) Multiprocessor, manual
allocation before partitioning.
Users have no idea to fix number of components
before partitioning.
29Objective
- Efficient Design Space Exploration (DSE) to
reduce partitioning overhead. - Determine Partition Boundaries at System-Level
- Insert pre-allocation stage before Allocation to
reduce Design Space. - Evaluation of associated cost
30Approach
- Specify-Explore-Refine Paradigm in system level
- Specification Specify Desired Functionality with
no Implementation Detail - Exploration Exploring Design Alternatives
satisfying the design constrains. Partition
functional specification among components.
Estimation of alternative design approaches. - Refinement Refine Initial Specification
reflecting decisions made in Exploration stage.
Verify initial specifications
31Design Space Exploration
System Behavior
Pre-Allocation Allocation
Performance Estimation
Partitioning
32Exploration
- Allocation Adding components to the Design.
- Each component characterized by its constraints
and technology file. - Partitioning Assigning functional modules to
components. (Behaviors to standard processors,
channels to buses etc) - Estimation Use of SpecSyn
- Cost Function k1.F(c1.size, c1.size_constr)
k2.F(c2.size, c2.size_constr)
33DSE(contd)
- Allocation mostly fixed in traditional
methodology - Architecture Processor (Controller), ASIC,
Memories - Includes only HW/SW partitioning (binary)
- Performance Estimation used only for Partitioning
34Problem Statement
- New Methodology
- Embedded systems now have Heterogeneous
Multiprocessors with ASICs, ASIPs, DSPs,
Processors, Memories.etc. - Design Space Exploration is thus a NP complete
problem
35Pre-Allocation
- Main Goal Reduce the Design Space to reduce the
Design Time - Use of Performance Estimation for Allocation
- Decision can be based on Heuristics at the system
level - Exact performance is determined after
Co-simulation
36Hardware Platform
- A platform with arrays of processors and ASIC
- Number of ASIC and processor to be used is based
on performance Estimation - It gives a start to Designer for Design Space
Exploration - Job of Designer Process Allocation and
interconnection among various component
37Proposed Methodology
- Build Process graph from specifications
- Identify leaf and root nodes
- Map Leaf to Processors and Root to ASIC
- Find performance parameter for each node when
implemented on ASIC,Processor,DSP - Find the critical path from the constraints
- Processor Merging
38Proposed Methodology(Contd)
- No two concurrent leaf on the same processor
- If constraint is not satisfied leaf is
implemented on a ASIC ( multiple copies) - Optimistic number of processor number of
concurrent leaf - Processors form library of functions
39Design Space Exploration
40Pros and Cons...
- Advantages Can implement many behavioral modules
on a same chip - DisadvantagesDifferent behavioral modules should
have common leaf nodes otherwise it is very
expensive
41Experiment and Results
42Cosimulation
- Definition
- Process of simulating the HW and SW
components of a mixed HW/SW system within a
unified environment.
43Cosimulation Related work
- Coumeri Thomas(ICCD 95)
- Tabbara et.al. (DATE 99)
- Durbhakula, Pai, Adve (HPCA 99)
- Pirvu, Bhuyan, Mahapatra (ICCD 2000)
44Co-simulation - Need
- Architectural simulators overlook hardware
complexity and lack accuracy - Integration of HDL models with Architecture level
simulator is pretty slow - Best solution is to implement the Subsystem under
Test in FPGA and integrate this with the
architecture level simulator
45Co-simulation - How it fits
Execution driven simulation
HDL Description
Synthesize
Resimulate
HW-SW Cosimulation
Execution driven simulation
46Cosimulation - Case Study
FPGA
47Cosimulation - Case study(contd)
- A Multiprocessor system with different switch
models, arbitration rules and buffering
techniques of the interconnect - RSIM - Architecture level Multiprocessor
simulation environment - FPGA implementation of switching network
- Serial port interface between the two
48Cosimulation - Case study(contd)
- Reducing Pin to Pin latency
- Optimization of Spider-like switch to minimize
Pin to Pin/Fall through latency - Flit size - 64 bits
- Phit size - 16 bits
- Assembling phits - 17.5ns
- Arbitration - 10ns
- Crossbar transfer - 10ns
- Serialization - 2.5ns
- Total time - 40ns
49Cosimulation - Case study(contd)
- Pipeline this structure
- Start processing immediately after 1st phit
arrives - Reduce data path size by 4 and increase core
frequency by 4 - Performance evaluation using Cosimulation shows
super pipelining can halve the fall through
latency
50Cosimulation - Case study(contd)
Results
First Phit arrives
Fall through Latency 40ns
2.5ns link
Synchronize 17.5ns
Arbitration 10ns
Crossbar tx 10ns
Wire tx
Spider-like design
First Phit arrives
Synch
Arbitration
Crossbar link tx (pipelined)
Superpipelined Design
Fall through Latency 17.5ns
51Cosimulation - Performance of Simulation
Techniques
Processor Memory
Interconnect Commn Simulation Commn
Simulation Synch
Synch - 1.26 - 0.08
4.60 1.26 1.16 4.70 0.35 1.26 NA
NA
Case
RSIM Rsim Verilog Rsim FPGA
52Modified Design Framework
53Modified Design Framework What it provides -
54Other related topics by Codesignteam at Texas
AM University
- Integration of Power simulation capabilities with
SimOS - Optimized VMX Architecture modeling for future
generation processors - Optimized HDLC Core
55Codesign Intelligent Agent
56Meeting Custom Design needs
57Summary
58Comments on this Framework
59What this framework lacks
60Co-simulation
61Co-simulation (contd)
62Co-simulation (contd)
63Modified Design Framework
64Modifies Design Framework (contd)
65Other related topics by Codesignteam at Texas
AM University
- Integration of Power simulation capabilities with
SimOS - Optimized VMX Architecture modeling for future
heneration processors - Optimized HDLC Core ...
66Future Research Plans
- Codesign Intelligent agent based on ANN hardware
- DSP reconfigurable array processors to be used as
HW-SW interface to meet custom designs
67Codesign Intelligent Agent
68Meeting Custom Design needs
69Conclusion