Title: Improving Embedded System Software Speed and Energy using MicroprocessorFPGA Platform ICs
1Improving Embedded System Software Speed and
Energy usingMicroprocessor/FPGA Platform ICs
- Frank Vahid
- Associate Professor
- Dept. of Computer Science and Engineering
- University of California, Riverside
- Also with the Center for Embedded Computer
Systems at UC Irvine - http//www.cs.ucr.edu/vahid
- This research has been supported by the National
Science Foundation, NEC, Trimedia, and Triscend
2General Purpose vs. Special Purpose
3General Purpose vs. Single Purpose Processors
total 0 for i 1 to N loop total
Mi end loop
- Designers have long known that
- General-purpose processors are flexible
- Single-purpose processors are fast
General purpose
Single purpose
OR
Flexibility Design cost Time-to-market
Performance Power efficiency Size
4Mixing General and Single Purpose Processors
- A.k.a. Hardware/software partitioning
- Hardware single-purpose processors
- coprocessor, accelerator, peripheral, etc.
- Software general-purpose processors
- Though hardware underneath!
- Especially important for embedded systems
- Computers embedded in devices (cameras, cars,
toys, even people) - Speed, cost, time-to-market, power, size,
demands are tough
5How is Partitioning Done for Embedded Systems?
- Partitioning into hw and sw blocks done early
- During conceptual stage
- Sw design done separately from hw design
- Attempts since late 1980s to automate not yet
successful - Partitioning manually is reasonably
straightforward - Spec is informal and not machine readable
- Sw algorithms may differ from hw algorithms
- No compelling need for tools
System Partitioning
Sw spec
Hw spec
Sw design
Hw design
Processor
ASIC
6New Platforms Invite New Efforts in Hw/Sw
Partitioning
- New single-chip platforms contain both
general-purpose processor and an FPGA - FPGA Field-programmable gate array
- Programmable just like software ? Flexible
- Intended largely to implement single-purpose
processors - Can we perform a later partitioning to improve
the software too?
Processor FPGA
System Partitioning
Sw spec
Hw spec
Sw design
Hw design
Processor FPGA
ASIC
7Commercial Single-Chip Microprocessor/FPGA
Platforms
- Triscend E5 based on 8-bit 8051 CISC core (2000)
- 10 Dhrystone MIPS at 40MHz
- up to 40K logic gates
- Cost only about 4
8Single-Chip Microprocessor/FPGA Platforms
- Atmel FPSLIC
- Field-Programmable System-Level IC
- Based on AVR 8-bit RISC core
- 20 Dhrystone MIPS
- 5k-40k logic gates
- 5-10
Courtesy of Atmel
9Single-Chip Microprocessor/FPGA Platforms
- Triscend A7 chip (2001)
- Based on ARM7 32-bit RISC processor
- 54 Dhrystone MIPS at 60 MHz
- Up to 40k logic gates
- 10-20 in volume
Courtesy of Triscend
10Single-Chip Microprocessor/FPGA Platforms
- Alteras Excalibur EPXA 10 (2002)
- ARM (922T) hard core
- 200 Dhrystone MIPS at 200 MHz
- 200k to 2 million logic gates
Source www.altera.com
11Single-Chip Microprocessor/FPGA Platforms
- Xilinx Virtex II Pro (2002)
- PowerPC based
- 420 Dhrystone MIPS at 300 MHz
- 1 to 4 PowerPCs
- 4 to 16 gigabit transceivers
- 12 to 216 multipliers
- Millions of logic gates
- 200k to 4M bits RAM
- 204 to 852 I/O
- 100-500 (gt25,000 units)
- Up to 16 serial transceivers
- 622 Mbps to 3.125 Gbps
PowerPCs
Config. logic
Courtesy of Xilinx
12Single-Chip Microprocessor/FPGA Platforms
- Why wouldnt future microprocessor chips include
some amount of on-chip FPGA?
- One argument against area
- Lots of silicon area taken up by FPGA
- FPGA about 20-30 times less area efficient than
custom logic - FPGA used to be for prototyping, too big for
final products - But chip trends imply that FPGAs will be O.K. in
final products
13How Much is Enough?
Perhaps a bit small
14How Much is Enough?
Reasonably sized
15How Much is Enough?
Probably plenty big for most of us
16How Much is Enough?
More than typically necessary
17How Much Custom Logic is Enough?
1993 1 million logic transistors
Perhaps a bit small
8-bit processor 50,000 tr. Pentium 3 million
tr. MPEG decoder several million tr.
18How Much Custom Logic is Enough?
1996 5-8 million logic transistors
Reasonably sized
19How Much Custom Logic is Enough?
1999 10-50 million logic transistors
Probably plenty big for most of us
20How Much Custom Logic is Enough?
2002 100-200 million logic transistors
More than typically necessary
21How Much Custom Logic is Enough?
1993 1 M
2008 gt1 BILLION logic transistors
Perhaps very few people could design this
22Very Few Companies Can Design High-End ICs
Design productivity gap
Moores Law
Source ITRS99
- Designer productivity growing at slower rate
- 1981 100 designer months ? 1M
- 2002 30,000 designer months ? 300M
23Single-Chip Platforms with On-Chip FPGAs
- So, big FPGAs on-chip are O.K., because
mainstream designers couldnt have used all that
silicon area anyways
- But, couldnt designers use custom logic instead
of FPGAs to make smaller chips and save costs?
24Shrinking Chips
- Yes, but theres a limit
- Chips becoming pin limited
Pads connecting to external pins
25Trend Towards Pre-Fabricated Platforms ASSPs
- ASSP application specific standard product
- Domain-specific pre-fabricated IC
- e.g., digital camera IC
- ASIC application specific IC
- ASSP revenue gt ASIC
- ASSP design starts gt ASIC
- Unique IC design
- Ignores quantity of same IC
- ASIC design starts decreasing
- Due to strong benefits of using pre-fabricated
devices
Source Gartner/Dataquest September01
26Microprocessor/FPGA Platforms
- Trends point towards such platforms increasing in
popularity - Can we automatically partition the software to
utilize the FPGA? - For improved speed and energy
27Automatic Hardware/Software Partitioning
- Since late 1980s goal has been spec in, hw/sw
out - But no successful commercial tool yet. Why?
// From MediaBenchs JPEG codec GLOBAL(void) jpeg_
fdct_ifast (DCTELEM data) DCTELEM tmp0,
tmp1, tmp2, tmp3, tmp4, tmp5, tmp6, tmp7
DCTELEM tmp10, tmp11, tmp12, tmp13 DCTELEM z1,
z2, z3, z4, z5, z11, z13 DCTELEM dataptr
int ctr SHIFT_TEMPS / Pass 1 process rows.
/ dataptr data for (ctr DCTSIZE-1 ctr
gt 0 ctr--) tmp0 dataptr0
dataptr7 tmp7 dataptr0 - dataptr7
tmp1 dataptr1 dataptr6 //
Thousands of lines like this in dozens of files
28Why No Successful Tool Yet?
- Most research has focused on extensive
exploration - Roots in VLSI CAD
- Decompose problem into fine-grained operations
- Apply sophisticated partitioning algorithms
- Examples
- Min-cut, dynamic programming, simulated
annealing, tabu-search, genetic evolution, etc. - Is this overkill?
1000s of nodes (like circuit partitioning)
Partitioner
29We Really Only Need Consider a Few Loops Due to
the 90-10 Rule
- Recent appearance of embedded benchmark suites
- Enables analysis ? understanding of the real
problem - Weve examined UCLAs MediaBench, Netbench,
Motorolas Powerstone - Currently examining EEMBC (embedded equivalent of
SPEC) - UCR loop analysis tools based on SimpleScalar and
Simics
// From MediaBenchs JPEG codec GLOBAL(void) jpeg_
fdct_ifast (DCTELEM data) DCTELEM tmp0,
tmp1, tmp2, tmp3, tmp4, tmp5, tmp6, tmp7
DCTELEM tmp10, tmp11, tmp12, tmp13 DCTELEM z1,
z2, z3, z4, z5, z11, z13 DCTELEM dataptr
int ctr SHIFT_TEMPS / Pass 1 process rows.
/ dataptr data for (ctr DCTSIZE-1 ctr
gt 0 ctr--) tmp0 dataptr0
dataptr7 tmp7 dataptr0 - dataptr7
tmp1 dataptr1 dataptr6
Assigned each loop a number, sorted by fraction
of contribution to total execution time
30The 90-10 Rule Holds for Embedded Systems
In fact, the most frequent loop alone took 50 of
time, using 1 of code
31So Need We Only Consider the First Few Loops? Not
Necessarily
- What if programs were self-similar w.r.t. 90-10
rule? - Remove most frequent loop 90-10 rule still
hold? - Intuition might say yes remove loop, and we
have another program.
- So we need only speedup the first few loops
- After that, speedups are limited
- Good from tool perspective!
32Manually Partitioned Several PowerStone
Benchmarks onto Triscend A7 and E5 Chips
E5 IC
- Used multimeter and timer to measure performance
and power - Obtained good speedups and energy savings by
partitioning software among microprocessor and
on-chip FPGA
Triscend A7 development board
33Simulation-Based Results for More Benchmarks
(Quicker than physical implementation, results
matched reasonably well)
34Looking at Multiple Loops per Benchmark
- Manually created several partitioned versions of
each benchmarks - Most speedup gained with first 20,000 gates
- Surprisingly few gates!
- Stitt, Grattan and Vahid, Field-programmable
Custom Computing Machines (FCCM) 2002 - Stitt and Vahid, IEEE Design and Test, Dec. 2002
- J. Villarreal, D. Suresh, G. Stitt, F. Vahid and
W. Najjar, Design Automation of Embedded Systems,
2002 (to appear).
35Ideal Speedups for Different Architectures
- Varied loop speedup ratio (sw time / hw time of
loop itself) to see impact of faster
microprocessor or slower FPGA 30, 20, 10 (base
case), 5 and 2 - Loop speedups of 5 or more work fine for first
few loops, not hard to achieve
36Ideal Energy Savings for Different Architectures
- Varied loop power ratio (FPGA power /
microprocessor power) to account for different
architectures 2.5, 2.0, 1.5 (base case), 1.0 - Energy savings quite resilient to variations
37How is Automated Partitioning Done?
Previous data obtained manually
System Partitioning
Sw spec
Hw spec
Sw design
Hw design
Partitioning
Processor FPGA
ASIC
38Source-Level Partitioning
SW Source _______ _______ _______
Front-end converts code into intermediate format,
such as SUIF (Stanford University Intermediate
Format)
Compiler Front-End
Intermediate format explored for hardware
candidates
Hw/Sw Partitioning
Compiler Back-End
Hw source
Assembly object files
Binary is generated from assembling and linking.
Hw source is generated and synthesized into
netlist
Assembler Linker
Synthesis
Binary
Netlists
Processor
FPGA
39Problems with Source-Level Partitioning
- Though technically superior, source-level
partitioning - Disrupts standard commercial tool flow
significantly - Requires special compiler (ouch!)
- Multiple source languages, changing source
languages - How deal with library code, assembly code, object
code
Compiler Front-end
C Source
C Source
Java Source
?
C SUIF Compiler
C SUIF Compiler
40Binary Partitioning
SW Source _______ _______ _______
Compilation
Assembly object files
Source code is first compiled and linked in order
to create a binary.
Assembler Linker
Binary
Candidate hardware regions (a few small, frequent
loops) are decompiled for partitioning
Hw/Sw Partitioning
Hw source
Updated Binary
HDL is generated and synthesized, and binary is
updated to use hardware
Synthesis
Netlists
Processor
FPGA
41Binary-Level Partitioning Results (ICCAD02)
- Binary-Level
- Average speedup, 1.4
- Average energy savings, 13
- Large area overhead averaging 10,325 gates
- Source-Level
- Average speedup, 1.5
- Average energy savings, 27
- Average 4,361 gates
42Binary Partitioning Could Eventually Lead to
Dynamic Hw/Sw Partitioning
- Dynamic software optimization gaining interest
- e.g., HPs Dynamo
- What better optimization than moving to FPGA?
- Add component on-chip
- Detects most frequent sw loops
- Decompiles a loop
- Performs compiler optimizations
- Synthesizes to a netlist
- Places and routes the netlist onto (simple) FPGA
- Updates sw to call FPGA
- Self-improving IC
- Can be invisible to designer
- Appears as efficient processor
- HARD! Much future work.
43Conclusions
- Hardware/software partitioning can significantly
improve software speed and energy - Single-chip microprocessor/FPGA platforms,
increasing in popularity, make such partitioning
even more attractive - Successful commercial tool still on the horizon
- Binary-level partitioning may help in some cases
- Source-level can yield massive parallelism
(Profs. Najjar/Payne) - Future dynamic hw/sw partitioning possible?
- Distinction between sw/hw continually being
blurred! - Many people involved
- Greg Stitt, Roman Lysecky, Shawn Nematbakhsh,
Dinesh Suresh, Walid Najjar, Jason Villarreal,
Tom Payne, several others - Support from NSF, Triscend, and soon SRC
- Exciting new directions!