Improving Embedded System Software Speed and Energy using MicroprocessorFPGA Platform ICs - PowerPoint PPT Presentation

About This Presentation

Title:

Improving Embedded System Software Speed and Energy using MicroprocessorFPGA Platform ICs

Description:

Becoming out of reach of mainstream designers ... But, couldn't designers use custom logic instead of FPGAs to make smaller chips and save costs? ... – PowerPoint PPT presentation

Number of Views:89

Avg rating:3.0/5.0

Slides: 44

Provided by: frank126

Learn more at: http://www.cs.ucr.edu

Category:

more less

Transcript and Presenter's Notes

Title: Improving Embedded System Software Speed and Energy using MicroprocessorFPGA Platform ICs

1
Improving Embedded System Software Speed and
Energy usingMicroprocessor/FPGA Platform ICs

Frank Vahid
Associate Professor
Dept. of Computer Science and Engineering
University of California, Riverside
Also with the Center for Embedded Computer
Systems at UC Irvine
http//www.cs.ucr.edu/vahid
This research has been supported by the National
Science Foundation, NEC, Trimedia, and Triscend

2
General Purpose vs. Special Purpose

Standard tradeoff

3
General Purpose vs. Single Purpose Processors
total 0 for i 1 to N loop total
Mi end loop

Designers have long known that
General-purpose processors are flexible
Single-purpose processors are fast

General purpose
Single purpose
OR
Flexibility Design cost Time-to-market
Performance Power efficiency Size
4
Mixing General and Single Purpose Processors

A.k.a. Hardware/software partitioning
Hardware single-purpose processors
coprocessor, accelerator, peripheral, etc.
Software general-purpose processors
Though hardware underneath!
Especially important for embedded systems
Computers embedded in devices (cameras, cars,
toys, even people)
Speed, cost, time-to-market, power, size,
demands are tough

5
How is Partitioning Done for Embedded Systems?

Partitioning into hw and sw blocks done early
During conceptual stage
Sw design done separately from hw design
Attempts since late 1980s to automate not yet
successful
Partitioning manually is reasonably
straightforward
Spec is informal and not machine readable
Sw algorithms may differ from hw algorithms
No compelling need for tools

System Partitioning
Sw spec
Hw spec
Sw design
Hw design
Processor
ASIC
6
New Platforms Invite New Efforts in Hw/Sw
Partitioning

New single-chip platforms contain both
general-purpose processor and an FPGA
FPGA Field-programmable gate array
Programmable just like software ? Flexible
Intended largely to implement single-purpose
processors
Can we perform a later partitioning to improve
the software too?

Processor FPGA
System Partitioning
Sw spec
Hw spec
Sw design
Hw design
Processor FPGA
ASIC
7
Commercial Single-Chip Microprocessor/FPGA
Platforms

Triscend E5 based on 8-bit 8051 CISC core (2000)
10 Dhrystone MIPS at 40MHz
up to 40K logic gates
Cost only about 4

8
Single-Chip Microprocessor/FPGA Platforms

Atmel FPSLIC
Field-Programmable System-Level IC
Based on AVR 8-bit RISC core
20 Dhrystone MIPS
5k-40k logic gates
5-10

Courtesy of Atmel
9
Single-Chip Microprocessor/FPGA Platforms

Triscend A7 chip (2001)
Based on ARM7 32-bit RISC processor
54 Dhrystone MIPS at 60 MHz
Up to 40k logic gates
10-20 in volume

Courtesy of Triscend
10
Single-Chip Microprocessor/FPGA Platforms

Alteras Excalibur EPXA 10 (2002)
ARM (922T) hard core
200 Dhrystone MIPS at 200 MHz
200k to 2 million logic gates

Source www.altera.com
11
Single-Chip Microprocessor/FPGA Platforms

Xilinx Virtex II Pro (2002)
PowerPC based
420 Dhrystone MIPS at 300 MHz
1 to 4 PowerPCs
4 to 16 gigabit transceivers
12 to 216 multipliers
Millions of logic gates
200k to 4M bits RAM
204 to 852 I/O
100-500 (gt25,000 units)

Up to 16 serial transceivers
622 Mbps to 3.125 Gbps

PowerPCs
Config. logic
Courtesy of Xilinx
12
Single-Chip Microprocessor/FPGA Platforms

Why wouldnt future microprocessor chips include
some amount of on-chip FPGA?

One argument against area
Lots of silicon area taken up by FPGA
FPGA about 20-30 times less area efficient than
custom logic
FPGA used to be for prototyping, too big for
final products
But chip trends imply that FPGAs will be O.K. in
final products

13
How Much is Enough?
Perhaps a bit small
14
How Much is Enough?
Reasonably sized
15
How Much is Enough?
Probably plenty big for most of us
16
How Much is Enough?
More than typically necessary
17
How Much Custom Logic is Enough?
1993 1 million logic transistors
Perhaps a bit small
8-bit processor 50,000 tr. Pentium 3 million
tr. MPEG decoder several million tr.
18
How Much Custom Logic is Enough?
1996 5-8 million logic transistors
Reasonably sized
19
How Much Custom Logic is Enough?
1999 10-50 million logic transistors
Probably plenty big for most of us
20
How Much Custom Logic is Enough?
2002 100-200 million logic transistors
More than typically necessary
21
How Much Custom Logic is Enough?
1993 1 M
2008 gt1 BILLION logic transistors
Perhaps very few people could design this
22
Very Few Companies Can Design High-End ICs
Design productivity gap
Moores Law
Source ITRS99

Designer productivity growing at slower rate
1981 100 designer months ? 1M
2002 30,000 designer months ? 300M

23
Single-Chip Platforms with On-Chip FPGAs

So, big FPGAs on-chip are O.K., because
mainstream designers couldnt have used all that
silicon area anyways

But, couldnt designers use custom logic instead
of FPGAs to make smaller chips and save costs?

24
Shrinking Chips

Yes, but theres a limit
Chips becoming pin limited

Pads connecting to external pins
25
Trend Towards Pre-Fabricated Platforms ASSPs

ASSP application specific standard product
Domain-specific pre-fabricated IC
e.g., digital camera IC
ASIC application specific IC
ASSP revenue gt ASIC
ASSP design starts gt ASIC
Unique IC design
Ignores quantity of same IC
ASIC design starts decreasing
Due to strong benefits of using pre-fabricated
devices

Source Gartner/Dataquest September01
26
Microprocessor/FPGA Platforms

Trends point towards such platforms increasing in
popularity
Can we automatically partition the software to
utilize the FPGA?
For improved speed and energy

27
Automatic Hardware/Software Partitioning

Since late 1980s goal has been spec in, hw/sw
out
But no successful commercial tool yet. Why?

// From MediaBenchs JPEG codec GLOBAL(void) jpeg_
fdct_ifast (DCTELEM data) DCTELEM tmp0,
tmp1, tmp2, tmp3, tmp4, tmp5, tmp6, tmp7
DCTELEM tmp10, tmp11, tmp12, tmp13 DCTELEM z1,
z2, z3, z4, z5, z11, z13 DCTELEM dataptr
int ctr SHIFT_TEMPS / Pass 1 process rows.
/ dataptr data for (ctr DCTSIZE-1 ctr
gt 0 ctr--) tmp0 dataptr0
dataptr7 tmp7 dataptr0 - dataptr7
tmp1 dataptr1 dataptr6 //
Thousands of lines like this in dozens of files
28
Why No Successful Tool Yet?

Most research has focused on extensive
exploration
Roots in VLSI CAD
Decompose problem into fine-grained operations
Apply sophisticated partitioning algorithms
Examples
Min-cut, dynamic programming, simulated
annealing, tabu-search, genetic evolution, etc.
Is this overkill?

1000s of nodes (like circuit partitioning)
Partitioner
29
We Really Only Need Consider a Few Loops Due to
the 90-10 Rule

Recent appearance of embedded benchmark suites
Enables analysis ? understanding of the real
problem
Weve examined UCLAs MediaBench, Netbench,
Motorolas Powerstone
Currently examining EEMBC (embedded equivalent of
SPEC)
UCR loop analysis tools based on SimpleScalar and
Simics

// From MediaBenchs JPEG codec GLOBAL(void) jpeg_
fdct_ifast (DCTELEM data) DCTELEM tmp0,
tmp1, tmp2, tmp3, tmp4, tmp5, tmp6, tmp7
DCTELEM tmp10, tmp11, tmp12, tmp13 DCTELEM z1,
z2, z3, z4, z5, z11, z13 DCTELEM dataptr
int ctr SHIFT_TEMPS / Pass 1 process rows.
/ dataptr data for (ctr DCTSIZE-1 ctr
gt 0 ctr--) tmp0 dataptr0
dataptr7 tmp7 dataptr0 - dataptr7
tmp1 dataptr1 dataptr6
Assigned each loop a number, sorted by fraction
of contribution to total execution time

30
The 90-10 Rule Holds for Embedded Systems
In fact, the most frequent loop alone took 50 of
time, using 1 of code
31
So Need We Only Consider the First Few Loops? Not
Necessarily

What if programs were self-similar w.r.t. 90-10
rule?
Remove most frequent loop 90-10 rule still
hold?
Intuition might say yes remove loop, and we
have another program.

So we need only speedup the first few loops
After that, speedups are limited
Good from tool perspective!

32
Manually Partitioned Several PowerStone
Benchmarks onto Triscend A7 and E5 Chips
E5 IC

Used multimeter and timer to measure performance
and power
Obtained good speedups and energy savings by
partitioning software among microprocessor and
on-chip FPGA

Triscend A7 development board
33
Simulation-Based Results for More Benchmarks
(Quicker than physical implementation, results
matched reasonably well)
34
Looking at Multiple Loops per Benchmark

Manually created several partitioned versions of
each benchmarks
Most speedup gained with first 20,000 gates
Surprisingly few gates!

Stitt, Grattan and Vahid, Field-programmable
Custom Computing Machines (FCCM) 2002
Stitt and Vahid, IEEE Design and Test, Dec. 2002
J. Villarreal, D. Suresh, G. Stitt, F. Vahid and
W. Najjar, Design Automation of Embedded Systems,
2002 (to appear).

35
Ideal Speedups for Different Architectures

Varied loop speedup ratio (sw time / hw time of
loop itself) to see impact of faster
microprocessor or slower FPGA 30, 20, 10 (base
case), 5 and 2
Loop speedups of 5 or more work fine for first
few loops, not hard to achieve

36
Ideal Energy Savings for Different Architectures

Varied loop power ratio (FPGA power /
microprocessor power) to account for different
architectures 2.5, 2.0, 1.5 (base case), 1.0
Energy savings quite resilient to variations

37
How is Automated Partitioning Done?
Previous data obtained manually
System Partitioning
Sw spec
Hw spec
Sw design
Hw design
Partitioning
Processor FPGA
ASIC
38
Source-Level Partitioning
SW Source _______ _______ _______
Front-end converts code into intermediate format,
such as SUIF (Stanford University Intermediate
Format)
Compiler Front-End
Intermediate format explored for hardware
candidates
Hw/Sw Partitioning
Compiler Back-End
Hw source
Assembly object files
Binary is generated from assembling and linking.
Hw source is generated and synthesized into
netlist
Assembler Linker
Synthesis
Binary
Netlists
Processor
FPGA
39
Problems with Source-Level Partitioning

Though technically superior, source-level
partitioning
Disrupts standard commercial tool flow
significantly
Requires special compiler (ouch!)
Multiple source languages, changing source
languages
How deal with library code, assembly code, object
code

Compiler Front-end
C Source
C Source
Java Source
?
C SUIF Compiler
C SUIF Compiler
40
Binary Partitioning
SW Source _______ _______ _______
Compilation
Assembly object files
Source code is first compiled and linked in order
to create a binary.
Assembler Linker
Binary
Candidate hardware regions (a few small, frequent
loops) are decompiled for partitioning
Hw/Sw Partitioning
Hw source
Updated Binary
HDL is generated and synthesized, and binary is
updated to use hardware
Synthesis
Netlists
Processor
FPGA
41
Binary-Level Partitioning Results (ICCAD02)

Binary-Level
Average speedup, 1.4
Average energy savings, 13
Large area overhead averaging 10,325 gates

Source-Level
Average speedup, 1.5
Average energy savings, 27
Average 4,361 gates

42
Binary Partitioning Could Eventually Lead to
Dynamic Hw/Sw Partitioning

Dynamic software optimization gaining interest
e.g., HPs Dynamo
What better optimization than moving to FPGA?
Add component on-chip
Detects most frequent sw loops
Decompiles a loop
Performs compiler optimizations
Synthesizes to a netlist
Places and routes the netlist onto (simple) FPGA
Updates sw to call FPGA

Self-improving IC
Can be invisible to designer
Appears as efficient processor
HARD! Much future work.

43
Conclusions

Hardware/software partitioning can significantly
improve software speed and energy
Single-chip microprocessor/FPGA platforms,
increasing in popularity, make such partitioning
even more attractive
Successful commercial tool still on the horizon
Binary-level partitioning may help in some cases
Source-level can yield massive parallelism
(Profs. Najjar/Payne)
Future dynamic hw/sw partitioning possible?
Distinction between sw/hw continually being
blurred!
Many people involved
Greg Stitt, Roman Lysecky, Shawn Nematbakhsh,
Dinesh Suresh, Walid Najjar, Jason Villarreal,
Tom Payne, several others
Support from NSF, Triscend, and soon SRC
Exciting new directions!