ECE 697F Reconfigurable Computing Lecture 21 HardwareSoftware CoDesign: Automatic Compilation to Rec - PowerPoint PPT Presentation

1 / 30
About This Presentation
Title:

ECE 697F Reconfigurable Computing Lecture 21 HardwareSoftware CoDesign: Automatic Compilation to Rec

Description:

Hardware/Software Co-Design: Automatic Compilation. to Reconfigurable Coprocessors ... Co-Design Methodology. Lecture 21: Hardware/Software Codesign. November 30, 2006 ... – PowerPoint PPT presentation

Number of Views:60
Avg rating:3.0/5.0
Slides: 31
Provided by: RussTe7
Category:

less

Transcript and Presenter's Notes

Title: ECE 697F Reconfigurable Computing Lecture 21 HardwareSoftware CoDesign: Automatic Compilation to Rec


1
ECE 697FReconfigurable ComputingLecture
21Hardware/Software Co-Design Automatic
Compilation to Reconfigurable Coprocessors
2
Overview
  • Hardware/software codesign involves partitioning
    part of a program for hardware implementation
  • Motivation
  • Algorithm specification
  • Partitioning
  • Implementation
  • Cosimulation

3
Definition Hardware/Software Co-Design
Courtesy Ragan, Sandborn, Stoaks
A design methodology supporting the cooperative
and concurrent development of hardware and
software (co-specification, co-development, and
co-verification) in order to achieve shared
functionality and performance goals for a
combined system1.
economic goals
1. Gupta, R. and De Micheli, G.,
Hardware-Software Cosynthesis for Digital
Systems, IEEE Design Test of Computers,
September 1993, pp. 29-41.
4
Objective
The target is to develop a methodology for
performing hardware and software development,
fabrication and support cost modeling concurrent
with hardware/software co-design.
5
Motivations
  • Not possible to put everything in hardware due to
    limited resources
  • Some code more appropriate for sequential
    implementation
  • Desirable to allow for parallelization,
    serialization
  • Possible to modify existing compilers to perform
    the task.

6
Methodology
  • Separation between function, and communication
  • Unified refinable formal specification model
  • facilitates system specification
  • implementation independent
  • eases HW/SW trade-off evaluation and partitioning

7
Co-Design Methodology
8
Computational Model
Memory bus
General- Purpose Processor
  • Most recent work addressing this problem assumes
    relatively slow bus interface
  • FPGA has direct interface to memory in this model
  • With arrival of new coprocessor architectures,
    likely migration to coprocessor interface
  • Interface to data cache

9
Interfacing A Key
  • Replace a portion of existing high-level code
    with calls to special-purpose hardware
  • Extract predictable timing for processors,
    busses, and synthesized hardware
  • Need to represent computation as a flowgraph
    including control flow
  • Need accurate resource evaluators to determine
    feature sizes

10
Flowgraph Generation
  • Behavioral description converted to directed
    acyclic graph
  • Resulting hardware can be synthesized to datapath
    and control graph

11
Determining Communication Level
Send, Receive, Wait
Application hardware (custom)
Register reads/writes
I/O driver
Interrupt service
Bus transactions
I/O bus
Interrupts
  • Easier to program at application level
  • (send, receive, wait) but difficult to predict
  • More difficult to specify at low level
  • Difficult to extract from program but timing and
    resources easier to predict

12
Partitioning Costs
  • In order to perform HW/SW partitioning, consider
    cost constraints Hardware/Software
  • First Hardware Resources
  • FPGA contains fixed number of gate resources,
    limited
  • local memory, and limited I/O bandwidth
  • Difficult to estimate timing if new hardware
    customizes
  • each time
  • Recent design shift towards IP (intellectual
    property)
  • Well-defined resource and timing characteristics

13
Timing Estimation
  • Key parameters
  • Execution rate of basic set of operations Simple
    to determine for RISC
  • Memory access time (bus, interrupt, other)
  • Critical constraints for computation must be
    defined
  • Min/Max delay constraints for software on
    processor
  • How often must a routine be invoked for time
    critical operation

14
System Partitioning
Line () a detach
Interface
Partition
Model
FPGA
Capture
Synthesize
Processor
  • Good partitioning mechanism
  • Minimize communication across bus
  • Allows parallelism -gt both hardware (FPGA) and
    processor operating concurrently
  • Near peak processor utilization at all times
    (performing useful work)

15
System Partitioning
  • Possible to mask bus and FPGA delay through
    processor context switches, d
  • In effect, hardware performs predetermined
    functions identified at compile time
  • Software performs functions determined at run
    time (conditionals, branches, etc)

16
Partitioning Analysis
  • Result of compilation is synthesizable HDL and
    assembly code for the processor
  • Compiler profiler determines dependence and rough
    performance estimates

17
Partitioning Algorithms
Software
Hardware
task
List of tasks
List of tasks
  • Assume everything initially in software
  • Select task for swapping
  • Migrate to hardware and evaluate cost
  • Timing, hardware resources, program and data
    storage, synchronization overhead
  • Cost evaluation and move evaluation similar to
    what weve seen regarding mincut and simulated
    annealing

18
Interface Models
  • Synchronization through a FIFO
  • FIFO can be implemented either in hardware or in
    software
  • Effectively reconfigure hardware (FPGA) to
    allocate buffer space as needed
  • Interrupts used for software version of FIFO

r3
p1
p2
p3
r2
d1
FPGA
Control/Data FIFO
d3
d2
19
Shared Memory Interface
  • Processor and FPGA interact through shared memory
  • Allows for simpler synchronization through
    semaphores

20
Codesign Verification
  • Run software on native processor
  • Run simulation of hardware using Verilog on
    another processor

Verilog Simulator
Application-specific hardware
Hardware Process 1
Hardware Process 1
Software process 1
Bus interface
Unix sockets
Software process 2
Verilog PLI
21
Results Part I
Clock cycles used
Benchmark SW HW-SW
tc () Speedup Diesel
22,403 16,394
9.9 1.4 Smooth
1,781,712 1,393,525 49.6
1.3 3D
1,377 1,514 13.8
0.9
  • Benchmarks controller diesel engine, image
    processing (smooth, 3D)
  • Cosyma Ernst, et al Braunschweig
  • Model uses bus interface shown previously
  • Shared memory model
  • All synchronization through semaphores
  • Creates flow graph from C program. Synopsys used
    to generate hardware
  • Targeted to embedded systems

22
Results Part II
Candidate cost and program speedup
Benchmark filter1 filter2 hamming
sed grep egrep gzip
gs speedup (A1) 1.08 4.11
14.4 1.0 1.0 1.0
1.0 1.0 speedup (A2) 1.17 5.01
14.4 1.2 1.1 1.5
2.9 1.0 Gate count (A1) 5120
5920 12416 0 0
0 0 0 Gate count (A2) 5120
5920 12416 1120 5504
5984 11712 960
  • Jantsch, et al Royal Institute of Technology,
    Sweden
  • A1 -gt 500 ns main memory, 100 ns local memory
  • A2 -gt 20 ns local and main memory access
  • FIFO model of transfer implemented in software
  • GNU C front end to extract data graphs,
    information
  • 4005 Xilinx part as computation element

23
Co-Design Approach
24
Software Cost Analysis Process
25
Hardware Cost Analysis Process
26
Foresight Co-Design
User-defined Reusables
State Machines
Mini-specs
Library Elements
System Requirements Capture
Functional Behavior Block Diagram
Integrated Toolset
Data Flow Monitors
Resource Specification
Architecture Block Diagram
System Characteristics
Derived from Foresight
Gate Count
Lines of Code
Cost Analysis (Ghost)
HW
SW
I/O Count
Number Up
Dev. Cost
Dev. Schedule
Fab. Cost
Die Size
Maintenance Cost
Test Cost
SCP Cost
Co-Design Process
Outputs
System Performance Metrics
System Cost
27
Example Hardware and Software Foundries
  • SW1
  • Nominal to High development effort
  • SW2
  • Low to Nominal development effort
  • HW1
  • LSI Logic ASIC Wafer Foundry Data
  • 0.18 mm feature size
  • 8 inch wafers
  • 6 layers
  • TSMC 018 Wafer Processing
  • HW2
  • Samsung Semiconductor ASIC Wafer Foundry Data
  • 0.35 mm feature size
  • 6 inch wafers
  • 4 layers
  • TSMC 035 Wafer Processing

28
MIXED Implementation Using HW1 and SW1
Software development
Testing
Testing
100
80
  • Reuse of
  • Gate-level IP
  • Code

Packaging
Packaging
60
Fabrication
Fabrication
Tooling
Tooling
Percent of Total Cost
Design
Design
40
20
0
1000, No
Recurring
10000, No
1000, 20
1000, 40
100000, No
10000, 20
10000, 40
100000, 20
100000, 40
Production Quantity and Level of Reuse
29
Total Cost Per Chip
45
40
35
10,000 Units
30
25
Total Cost (/chip)
20
15
10
HW1/SW1
HW1/SW2
5
HW2/SW1
HW2/SW2
0
0
10
20
30
40
50
60
70
80
90
100
Percent Custom Hardware
30
Summary
  • Hardware/software codesign complicated and
    limited by performance estimates
  • Algorithms not generally as good as human
    partitioning
  • Other interesting issues include dual processors,
    special memory interfaces
  • Will likely evolve at faster rate as compilers
    evolve
Write a Comment
User Comments (0)
About PowerShow.com