ECE 697F Reconfigurable Computing Lecture 21 HardwareSoftware CoDesign: Automatic Compilation to Rec

About This Presentation

Title:

ECE 697F Reconfigurable Computing Lecture 21 HardwareSoftware CoDesign: Automatic Compilation to Rec

Description:

Hardware/Software Co-Design: Automatic Compilation. to Reconfigurable Coprocessors ... Co-Design Methodology. Lecture 21: Hardware/Software Codesign. November 30, 2006 ... – PowerPoint PPT presentation

Number of Views:60

Avg rating:3.0/5.0

Slides: 31

Provided by: RussTe7

Category:

more less

Transcript and Presenter's Notes

Title: ECE 697F Reconfigurable Computing Lecture 21 HardwareSoftware CoDesign: Automatic Compilation to Rec

1
ECE 697FReconfigurable ComputingLecture
21Hardware/Software Co-Design Automatic
Compilation to Reconfigurable Coprocessors
2
Overview

Hardware/software codesign involves partitioning
part of a program for hardware implementation
Motivation
Algorithm specification
Partitioning
Implementation
Cosimulation

3
Definition Hardware/Software Co-Design
Courtesy Ragan, Sandborn, Stoaks
A design methodology supporting the cooperative
and concurrent development of hardware and
software (co-specification, co-development, and
co-verification) in order to achieve shared
functionality and performance goals for a
combined system1.
economic goals
1. Gupta, R. and De Micheli, G.,
Hardware-Software Cosynthesis for Digital
Systems, IEEE Design Test of Computers,
September 1993, pp. 29-41.
4
Objective
The target is to develop a methodology for
performing hardware and software development,
fabrication and support cost modeling concurrent
with hardware/software co-design.
5
Motivations

Not possible to put everything in hardware due to
limited resources
Some code more appropriate for sequential
implementation
Desirable to allow for parallelization,
serialization
Possible to modify existing compilers to perform
the task.

6
Methodology

Separation between function, and communication
Unified refinable formal specification model
facilitates system specification
implementation independent
eases HW/SW trade-off evaluation and partitioning

7
Co-Design Methodology
8
Computational Model
Memory bus
General- Purpose Processor

Most recent work addressing this problem assumes
relatively slow bus interface
FPGA has direct interface to memory in this model
With arrival of new coprocessor architectures,
likely migration to coprocessor interface
Interface to data cache

9
Interfacing A Key

Replace a portion of existing high-level code
with calls to special-purpose hardware
Extract predictable timing for processors,
busses, and synthesized hardware
Need to represent computation as a flowgraph
including control flow
Need accurate resource evaluators to determine
feature sizes

10
Flowgraph Generation

Behavioral description converted to directed
acyclic graph
Resulting hardware can be synthesized to datapath
and control graph

11
Determining Communication Level
Send, Receive, Wait
Application hardware (custom)
Register reads/writes
I/O driver
Interrupt service
Bus transactions
I/O bus
Interrupts

Easier to program at application level
(send, receive, wait) but difficult to predict
More difficult to specify at low level
Difficult to extract from program but timing and
resources easier to predict

12
Partitioning Costs

In order to perform HW/SW partitioning, consider
cost constraints Hardware/Software
First Hardware Resources
FPGA contains fixed number of gate resources,
limited
local memory, and limited I/O bandwidth
Difficult to estimate timing if new hardware
customizes
each time
Recent design shift towards IP (intellectual
property)
Well-defined resource and timing characteristics

13
Timing Estimation

Key parameters
Execution rate of basic set of operations Simple
to determine for RISC
Memory access time (bus, interrupt, other)
Critical constraints for computation must be
defined
Min/Max delay constraints for software on
processor
How often must a routine be invoked for time
critical operation

14
System Partitioning
Line () a detach
Interface
Partition
Model
FPGA
Capture
Synthesize
Processor

Good partitioning mechanism
Minimize communication across bus
Allows parallelism -gt both hardware (FPGA) and
processor operating concurrently
Near peak processor utilization at all times
(performing useful work)

15
System Partitioning

Possible to mask bus and FPGA delay through
processor context switches, d
In effect, hardware performs predetermined
functions identified at compile time
Software performs functions determined at run
time (conditionals, branches, etc)

16
Partitioning Analysis

Result of compilation is synthesizable HDL and
assembly code for the processor
Compiler profiler determines dependence and rough
performance estimates

17
Partitioning Algorithms
Software
Hardware
task
List of tasks
List of tasks

Assume everything initially in software
Select task for swapping
Migrate to hardware and evaluate cost
Timing, hardware resources, program and data
storage, synchronization overhead
Cost evaluation and move evaluation similar to
what weve seen regarding mincut and simulated
annealing

18
Interface Models

Synchronization through a FIFO
FIFO can be implemented either in hardware or in
software
Effectively reconfigure hardware (FPGA) to
allocate buffer space as needed
Interrupts used for software version of FIFO

r3
p1
p2
p3
r2
d1
FPGA
Control/Data FIFO
d3
d2
19
Shared Memory Interface

Processor and FPGA interact through shared memory
Allows for simpler synchronization through
semaphores

20
Codesign Verification

Run software on native processor
Run simulation of hardware using Verilog on
another processor

Verilog Simulator
Application-specific hardware
Hardware Process 1
Hardware Process 1
Software process 1
Bus interface
Unix sockets
Software process 2
Verilog PLI
21
Results Part I
Clock cycles used
Benchmark SW HW-SW
tc () Speedup Diesel
22,403 16,394
9.9 1.4 Smooth
1,781,712 1,393,525 49.6
1.3 3D
1,377 1,514 13.8
0.9

Benchmarks controller diesel engine, image
processing (smooth, 3D)
Cosyma Ernst, et al Braunschweig
Model uses bus interface shown previously
Shared memory model
All synchronization through semaphores
Creates flow graph from C program. Synopsys used
to generate hardware
Targeted to embedded systems

22
Results Part II
Candidate cost and program speedup
Benchmark filter1 filter2 hamming
sed grep egrep gzip
gs speedup (A1) 1.08 4.11
14.4 1.0 1.0 1.0
1.0 1.0 speedup (A2) 1.17 5.01
14.4 1.2 1.1 1.5
2.9 1.0 Gate count (A1) 5120
5920 12416 0 0
0 0 0 Gate count (A2) 5120
5920 12416 1120 5504
5984 11712 960

Jantsch, et al Royal Institute of Technology,
Sweden
A1 -gt 500 ns main memory, 100 ns local memory
A2 -gt 20 ns local and main memory access
FIFO model of transfer implemented in software
GNU C front end to extract data graphs,
information
4005 Xilinx part as computation element

23
Co-Design Approach
24
Software Cost Analysis Process
25
Hardware Cost Analysis Process
26
Foresight Co-Design
User-defined Reusables
State Machines
Mini-specs
Library Elements
System Requirements Capture
Functional Behavior Block Diagram
Integrated Toolset
Data Flow Monitors
Resource Specification
Architecture Block Diagram
System Characteristics
Derived from Foresight
Gate Count
Lines of Code
Cost Analysis (Ghost)
HW
SW
I/O Count
Number Up
Dev. Cost
Dev. Schedule
Fab. Cost
Die Size
Maintenance Cost
Test Cost
SCP Cost
Co-Design Process
Outputs
System Performance Metrics
System Cost
27
Example Hardware and Software Foundries

SW1
Nominal to High development effort
SW2
Low to Nominal development effort

HW1
LSI Logic ASIC Wafer Foundry Data
0.18 mm feature size
8 inch wafers
6 layers
TSMC 018 Wafer Processing
HW2
Samsung Semiconductor ASIC Wafer Foundry Data
0.35 mm feature size
6 inch wafers
4 layers
TSMC 035 Wafer Processing

28
MIXED Implementation Using HW1 and SW1
Software development
Testing
Testing
100
80

Reuse of
Gate-level IP
Code

Packaging
Packaging
60
Fabrication
Fabrication
Tooling
Tooling
Percent of Total Cost
Design
Design
40
20
0
1000, No
Recurring
10000, No
1000, 20
1000, 40
100000, No
10000, 20
10000, 40
100000, 20
100000, 40
Production Quantity and Level of Reuse
29
Total Cost Per Chip
45
40
35
10,000 Units
30
25
Total Cost (/chip)
20
15
10
HW1/SW1
HW1/SW2
5
HW2/SW1
HW2/SW2
0
0
10
20
30
40
50
60
70
80
90
100
Percent Custom Hardware
30
Summary

Hardware/software codesign complicated and
limited by performance estimates
Algorithms not generally as good as human
partitioning
Other interesting issues include dual processors,
special memory interfaces
Will likely evolve at faster rate as compilers
evolve

Write a Comment

User Comments (0)

About PowerShow.com

ECE 697F Reconfigurable Computing Lecture 21 HardwareSoftware CoDesign: Automatic Compilation to Rec - PowerPoint PPT Presentation

ECE 697F Reconfigurable Computing Lecture 21 HardwareSoftware CoDesign: Automatic Compilation to Rec

Hardware/Software Co-Design: Automatic Compilation. to Reconfigurable Coprocessors ... Co-Design Methodology. Lecture 21: Hardware/Software Codesign. November 30, 2006 ... – PowerPoint PPT presentation