A Comprehensive Codesign Framework for Embedded Systems Rabi Mahapatra Department of Computer Scienc - PowerPoint PPT Presentation

1 / 69

About This Presentation

Title:

A Comprehensive Codesign Framework for Embedded Systems Rabi Mahapatra Department of Computer Scienc

Description:

Our approach - consider Area, Time and Power for Design optimization ... Software area using memory size used up in bytes ... Enumeration-Gray code with Lookup Table ... – PowerPoint PPT presentation

Number of Views:187

Avg rating:3.0/5.0

Slides: 70

Provided by: Mah129

Learn more at: http://research.cs.tamu.edu

Category:

more less

Transcript and Presenter's Notes

Title: A Comprehensive Codesign Framework for Embedded Systems Rabi Mahapatra Department of Computer Scienc

1
A Comprehensive Codesign Framework for Embedded
SystemsRabi MahapatraDepartment of Computer
Science Engg.Texas AM UniversityApril 2001
2
Topics to be covered

Introduction
Existing Codesign Framework on ES
What is lacking in this framework
Two hot-spots investigation
Available results
Other Research topics by Codesign team at Texas
AM
Conclusion

3
Introduction

Hardware-Software Codesign for embedded system
research program by DARPA and NSF

4
Introduction

Hardware-Software Codesign for embedded system
research program by DARPA and NSF
Codesign Definition
Meeting System level objectives by exploiting
the synergism of hardware and software through
their concurrent design

5
Concurrent design

Traditional design flow

Concurrent (codesign) flow

Start
Start
SW
HW
HW
SW
Designed by independent groups of experts
Designed by Same group of experts with
cooperation
6
Why codesign?

Reduce time to market
Achieve better design
Explore alternative designs
Good design can be found by balancing the HW/SW
To meet strict design constraint
power, size, timing, and performance trade-off
safety and reliability
system on chip

7
Current Design Framework
8
Framework Embedded System Codesign
9
Contemporary Co-design framework
System Specification
Front end Compiler
Validation
Behavior Description of Modules
Partitioning
Performance Estimation
Synthesis S/W Commn H/W
Integration
Co-Simulation
Constraint Verification
Implementation CPU ASIC Memory
10
Two hot-spots in Codesign

Partitioning
Co-simulation

11
Current Partitioning

Binary partitioning prevails
Manual even in most recent VCC tool
Power is not considered as a parameter while
partitioning
Problem is NP complete imagine extended
partitioning, multiple-space partitioning!

12
Partitioning

Related Work
Tiwari, Malik Wolfe, Power Analysis of
Embedded Software A First step towards Software
power minimization, IEEE Trans VLSI, 1994
Givargis, Henkel Vahid, Interface and Cache
power exploration for core based Embedded
system, ICCAD 1999.
Stiti, Vahid etal, A first step towards an
architecture tuning methodology for low power,
CASES 2000.
Lu, Benine MiCheli, Low-Power Task Scheduling
for Multiple devices, CODES 2000.
Our approach - consider Area, Time and Power for
Design optimization

13
Partitioning with Power as a parameter
System Specification in C
Simulation Analysis of Software Implementation
Convert C code to VHDL (ART Builder)
Power Analysis Sp (Power Profiler)
Calculate Sa,St
Hardware Synthesis (Design Compiler)
Power Analysis Hp (Power Compiler)
Calculate Ha,Ht
Optimization Engine
System Partitioned into H/W S/W
14
Partitioning - Software Analysis

Software power for a specific microprocessor
using Power profiler.
Software time using assembly level code and
instruction time profile
Software area using memory size used up in bytes
SPARClite processor has been used as the target
for this experiment.

15
Partitioning - Power Profiler
Input
File
in
Assembly
Total Current 0
Read the file line by line
S
YE
Is
Output
Total
EOF?
Current3.3
NO
Identify Instruction INST
line
from the current
Compare INST with the
Hash table entries
NO
Does INST
Skip the current line
match with
a Hash
Table
YES
Get the corresponding
value of the Current
Current
Total Current
16
Partitioning - Hardware synthesis

Hardware power using Synopsys Power Compiler
Hardware Area and time using Synopsys Design
Analyzer
Technology in use

17
Partitioning - EGLT optimization

Enumeration-Gray code with Lookup Table
complete coverage, Most efficient for medium
sized binary partitioning.
Case Studies
QAM Modem - Has 21 modules for partitioning into
hardware and software modules
Optimization engine optimizes the allocation
taking Area, Power and Timing requirements.

18
Partitioning - Case Study
A Ptolemy Snapshot of Modem design
19
Software Analysis Modem

Block Name Area(bytes) Time(ns) Power (mW)
AddCx 6956 640 873.5
AddInt 6804 782 885.2
BitsToInt 7024 1440 840.0
C2R 6864 626 869.5
Dist 6868 513 863.3
FtoI 6956 74340 832.1
GainInt 6892 9476 886.9
Gaussian 7304 76469 811.5
IID Uniform 7108 64282 832.1
LMSCx 8508 13697 806.1
ModuloInt 6900 8715 879.7
Quant 7140 4173 816.6
R2C 6864 367 878.3
SubCx 6972 547 873.5
TableCx 7736 902 845.8
TableInt 7172 853 848.6

20
Hardware SynthesisModem

Block Name Area(of gates) Time(ns) Power(mW)
AddCx 594 9.96 40.56
AddInt 299 9.96 19.6825
BitsToInt 114 1.07 9.1423
C2R 0 0 3.1689
Dist 0 0 3.1689
FtoI 42 0.43 1.13
GainInt 0 0 1.5350
Gaussian 13555 2.634 2W
IID 13555 2.634 2 W
LMSCx 80215 30.68 8.0758 W
ModuloInt 0 0 99.0293 uW
Quant 59 1.18 4.4644
R2C 0 0 3.1689
SubCx 687 9.96 49.2913
TableCx 332 3.18 21.6056
TableInt 178 2.54 11.0962

21
Functions
22
EGLT optimization

Decimal Binary Gray
0 0000 0000
1 0001 0001
2 0010 0011
3 0011 0010
4 0100 0110
5 0101 0111
6 0110 0101
7 0111 0100
8 1000 1100 etc.

23
Results

Best partition
HW Modules AddInt, AddCx, TableInt, TableCx,
GainInt, IID Uniform, Modulo, Dist, Bits to Int,
Quant, CtoR, RtoC, SubCx
SW Modules IID Gaussian, LMSCx, Float to Int
Lowest Power obtained for the constraints set
4619 mW
Hardware Area 16238
Software Area 23508
Timing 166682 ns

24
EGLT Partitioning Results
25
Optimization operations
26
(No Transcript)
27
Partitioning overheads

Growing design space due to complex architecture,
technology, and solutions
Number of implementations for mapping a system
specification made of n tasks on an architecture
made of q nonempty modules S(n,q) ?
With p different kind of technology to implement
each module, number of possible implementation
grows to NbArchitecture (n, p)
Example n4, p 2, q 3 Nb 309

28
DSE Related work

PMOSS (1996), LYCOS (1997), COSYMA (1998)
Mono-processor
POLIS (1996) one up and several co-processor
SpecSyn (1998) Multiprocessor, manual
allocation before partitioning.

Users have no idea to fix number of components
before partitioning.
29
Objective

Efficient Design Space Exploration (DSE) to
reduce partitioning overhead.
Determine Partition Boundaries at System-Level
Insert pre-allocation stage before Allocation to
reduce Design Space.
Evaluation of associated cost

30
Approach

Specify-Explore-Refine Paradigm in system level
Specification Specify Desired Functionality with
no Implementation Detail
Exploration Exploring Design Alternatives
satisfying the design constrains. Partition
functional specification among components.
Estimation of alternative design approaches.
Refinement Refine Initial Specification
reflecting decisions made in Exploration stage.
Verify initial specifications

31
Design Space Exploration
System Behavior
Pre-Allocation Allocation
Performance Estimation
Partitioning
32
Exploration

Allocation Adding components to the Design.
Each component characterized by its constraints
and technology file.
Partitioning Assigning functional modules to
components. (Behaviors to standard processors,
channels to buses etc)
Estimation Use of SpecSyn
Cost Function k1.F(c1.size, c1.size_constr)
k2.F(c2.size, c2.size_constr)

33
DSE(contd)

Allocation mostly fixed in traditional
methodology
Architecture Processor (Controller), ASIC,
Memories
Includes only HW/SW partitioning (binary)
Performance Estimation used only for Partitioning

34
Problem Statement

New Methodology
Embedded systems now have Heterogeneous
Multiprocessors with ASICs, ASIPs, DSPs,
Processors, Memories.etc.
Design Space Exploration is thus a NP complete
problem

35
Pre-Allocation

Main Goal Reduce the Design Space to reduce the
Design Time
Use of Performance Estimation for Allocation
Decision can be based on Heuristics at the system
level
Exact performance is determined after
Co-simulation

36
Hardware Platform

A platform with arrays of processors and ASIC
Number of ASIC and processor to be used is based
on performance Estimation
It gives a start to Designer for Design Space
Exploration
Job of Designer Process Allocation and
interconnection among various component

37
Proposed Methodology

Build Process graph from specifications
Identify leaf and root nodes
Map Leaf to Processors and Root to ASIC
Find performance parameter for each node when
implemented on ASIC,Processor,DSP
Find the critical path from the constraints
Processor Merging

38
Proposed Methodology(Contd)

No two concurrent leaf on the same processor
If constraint is not satisfied leaf is
implemented on a ASIC ( multiple copies)
Optimistic number of processor number of
concurrent leaf
Processors form library of functions

39
Design Space Exploration
40
Pros and Cons...

Advantages Can implement many behavioral modules
on a same chip
DisadvantagesDifferent behavioral modules should
have common leaf nodes otherwise it is very
expensive

41
Experiment and Results
42
Cosimulation

Definition
Process of simulating the HW and SW
components of a mixed HW/SW system within a
unified environment.

43
Cosimulation Related work

Coumeri Thomas(ICCD 95)
Tabbara et.al. (DATE 99)
Durbhakula, Pai, Adve (HPCA 99)
Pirvu, Bhuyan, Mahapatra (ICCD 2000)

44
Co-simulation - Need

Architectural simulators overlook hardware
complexity and lack accuracy
Integration of HDL models with Architecture level
simulator is pretty slow
Best solution is to implement the Subsystem under
Test in FPGA and integrate this with the
architecture level simulator

45
Co-simulation - How it fits
Execution driven simulation
HDL Description
Synthesize
Resimulate
HW-SW Cosimulation
Execution driven simulation
46
Cosimulation - Case Study
FPGA
47
Cosimulation - Case study(contd)

A Multiprocessor system with different switch
models, arbitration rules and buffering
techniques of the interconnect
RSIM - Architecture level Multiprocessor
simulation environment
FPGA implementation of switching network
Serial port interface between the two

48
Cosimulation - Case study(contd)

Reducing Pin to Pin latency
Optimization of Spider-like switch to minimize
Pin to Pin/Fall through latency
Flit size - 64 bits
Phit size - 16 bits
Assembling phits - 17.5ns
Arbitration - 10ns
Crossbar transfer - 10ns
Serialization - 2.5ns
Total time - 40ns

49
Cosimulation - Case study(contd)

Pipeline this structure
Start processing immediately after 1st phit
arrives
Reduce data path size by 4 and increase core
frequency by 4
Performance evaluation using Cosimulation shows
super pipelining can halve the fall through
latency

50
Cosimulation - Case study(contd)
Results
First Phit arrives
Fall through Latency 40ns
2.5ns link
Synchronize 17.5ns
Arbitration 10ns
Crossbar tx 10ns
Wire tx
Spider-like design
First Phit arrives
Synch
Arbitration
Crossbar link tx (pipelined)
Superpipelined Design
Fall through Latency 17.5ns
51
Cosimulation - Performance of Simulation
Techniques
Processor Memory
Interconnect Commn Simulation Commn
Simulation Synch
Synch - 1.26 - 0.08
4.60 1.26 1.16 4.70 0.35 1.26 NA
NA
Case
RSIM Rsim Verilog Rsim FPGA
52
Modified Design Framework
53
Modified Design Framework What it provides -
54
Other related topics by Codesignteam at Texas
AM University

Integration of Power simulation capabilities with
SimOS
Optimized VMX Architecture modeling for future
generation processors
Optimized HDLC Core

55
Codesign Intelligent Agent

Comments on this

56
Meeting Custom Design needs

Comments on this

57
Summary
58
Comments on this Framework
59
What this framework lacks
60
Co-simulation
61
Co-simulation (contd)
62
Co-simulation (contd)
63
Modified Design Framework
64
Modifies Design Framework (contd)
65
Other related topics by Codesignteam at Texas
AM University