Improving Embedded System Software Speed and Energy using MicroprocessorFPGA Platform ICs - PowerPoint PPT Presentation

About This Presentation
Title:

Improving Embedded System Software Speed and Energy using MicroprocessorFPGA Platform ICs

Description:

Becoming out of reach of mainstream designers ... But, couldn't designers use custom logic instead of FPGAs to make smaller chips and save costs? ... – PowerPoint PPT presentation

Number of Views:89
Avg rating:3.0/5.0
Slides: 44
Provided by: frank126
Learn more at: http://www.cs.ucr.edu
Category:

less

Transcript and Presenter's Notes

Title: Improving Embedded System Software Speed and Energy using MicroprocessorFPGA Platform ICs


1
Improving Embedded System Software Speed and
Energy usingMicroprocessor/FPGA Platform ICs
  • Frank Vahid
  • Associate Professor
  • Dept. of Computer Science and Engineering
  • University of California, Riverside
  • Also with the Center for Embedded Computer
    Systems at UC Irvine
  • http//www.cs.ucr.edu/vahid
  • This research has been supported by the National
    Science Foundation, NEC, Trimedia, and Triscend

2
General Purpose vs. Special Purpose
  • Standard tradeoff

3
General Purpose vs. Single Purpose Processors
total 0 for i 1 to N loop total
Mi end loop
  • Designers have long known that
  • General-purpose processors are flexible
  • Single-purpose processors are fast

General purpose
Single purpose
OR
Flexibility Design cost Time-to-market
Performance Power efficiency Size
4
Mixing General and Single Purpose Processors
  • A.k.a. Hardware/software partitioning
  • Hardware single-purpose processors
  • coprocessor, accelerator, peripheral, etc.
  • Software general-purpose processors
  • Though hardware underneath!
  • Especially important for embedded systems
  • Computers embedded in devices (cameras, cars,
    toys, even people)
  • Speed, cost, time-to-market, power, size,
    demands are tough

5
How is Partitioning Done for Embedded Systems?
  • Partitioning into hw and sw blocks done early
  • During conceptual stage
  • Sw design done separately from hw design
  • Attempts since late 1980s to automate not yet
    successful
  • Partitioning manually is reasonably
    straightforward
  • Spec is informal and not machine readable
  • Sw algorithms may differ from hw algorithms
  • No compelling need for tools

System Partitioning
Sw spec
Hw spec
Sw design
Hw design
Processor
ASIC
6
New Platforms Invite New Efforts in Hw/Sw
Partitioning
  • New single-chip platforms contain both
    general-purpose processor and an FPGA
  • FPGA Field-programmable gate array
  • Programmable just like software ? Flexible
  • Intended largely to implement single-purpose
    processors
  • Can we perform a later partitioning to improve
    the software too?

Processor FPGA
System Partitioning
Sw spec
Hw spec
Sw design
Hw design
Processor FPGA
ASIC
7
Commercial Single-Chip Microprocessor/FPGA
Platforms
  • Triscend E5 based on 8-bit 8051 CISC core (2000)
  • 10 Dhrystone MIPS at 40MHz
  • up to 40K logic gates
  • Cost only about 4

8
Single-Chip Microprocessor/FPGA Platforms
  • Atmel FPSLIC
  • Field-Programmable System-Level IC
  • Based on AVR 8-bit RISC core
  • 20 Dhrystone MIPS
  • 5k-40k logic gates
  • 5-10

Courtesy of Atmel
9
Single-Chip Microprocessor/FPGA Platforms
  • Triscend A7 chip (2001)
  • Based on ARM7 32-bit RISC processor
  • 54 Dhrystone MIPS at 60 MHz
  • Up to 40k logic gates
  • 10-20 in volume

Courtesy of Triscend
10
Single-Chip Microprocessor/FPGA Platforms
  • Alteras Excalibur EPXA 10 (2002)
  • ARM (922T) hard core
  • 200 Dhrystone MIPS at 200 MHz
  • 200k to 2 million logic gates

Source www.altera.com
11
Single-Chip Microprocessor/FPGA Platforms
  • Xilinx Virtex II Pro (2002)
  • PowerPC based
  • 420 Dhrystone MIPS at 300 MHz
  • 1 to 4 PowerPCs
  • 4 to 16 gigabit transceivers
  • 12 to 216 multipliers
  • Millions of logic gates
  • 200k to 4M bits RAM
  • 204 to 852 I/O
  • 100-500 (gt25,000 units)
  • Up to 16 serial transceivers
  • 622 Mbps to 3.125 Gbps

PowerPCs
Config. logic
Courtesy of Xilinx
12
Single-Chip Microprocessor/FPGA Platforms
  • Why wouldnt future microprocessor chips include
    some amount of on-chip FPGA?
  • One argument against area
  • Lots of silicon area taken up by FPGA
  • FPGA about 20-30 times less area efficient than
    custom logic
  • FPGA used to be for prototyping, too big for
    final products
  • But chip trends imply that FPGAs will be O.K. in
    final products

13
How Much is Enough?
Perhaps a bit small
14
How Much is Enough?
Reasonably sized
15
How Much is Enough?
Probably plenty big for most of us
16
How Much is Enough?
More than typically necessary
17
How Much Custom Logic is Enough?
1993 1 million logic transistors
Perhaps a bit small
8-bit processor 50,000 tr. Pentium 3 million
tr. MPEG decoder several million tr.
18
How Much Custom Logic is Enough?
1996 5-8 million logic transistors
Reasonably sized
19
How Much Custom Logic is Enough?
1999 10-50 million logic transistors
Probably plenty big for most of us
20
How Much Custom Logic is Enough?
2002 100-200 million logic transistors
More than typically necessary
21
How Much Custom Logic is Enough?
1993 1 M
2008 gt1 BILLION logic transistors
Perhaps very few people could design this
22
Very Few Companies Can Design High-End ICs
Design productivity gap
Moores Law
Source ITRS99
  • Designer productivity growing at slower rate
  • 1981 100 designer months ? 1M
  • 2002 30,000 designer months ? 300M

23
Single-Chip Platforms with On-Chip FPGAs
  • So, big FPGAs on-chip are O.K., because
    mainstream designers couldnt have used all that
    silicon area anyways
  • But, couldnt designers use custom logic instead
    of FPGAs to make smaller chips and save costs?

24
Shrinking Chips
  • Yes, but theres a limit
  • Chips becoming pin limited

Pads connecting to external pins
25
Trend Towards Pre-Fabricated Platforms ASSPs
  • ASSP application specific standard product
  • Domain-specific pre-fabricated IC
  • e.g., digital camera IC
  • ASIC application specific IC
  • ASSP revenue gt ASIC
  • ASSP design starts gt ASIC
  • Unique IC design
  • Ignores quantity of same IC
  • ASIC design starts decreasing
  • Due to strong benefits of using pre-fabricated
    devices

Source Gartner/Dataquest September01
26
Microprocessor/FPGA Platforms
  • Trends point towards such platforms increasing in
    popularity
  • Can we automatically partition the software to
    utilize the FPGA?
  • For improved speed and energy

27
Automatic Hardware/Software Partitioning
  • Since late 1980s goal has been spec in, hw/sw
    out
  • But no successful commercial tool yet. Why?

// From MediaBenchs JPEG codec GLOBAL(void) jpeg_
fdct_ifast (DCTELEM data) DCTELEM tmp0,
tmp1, tmp2, tmp3, tmp4, tmp5, tmp6, tmp7
DCTELEM tmp10, tmp11, tmp12, tmp13 DCTELEM z1,
z2, z3, z4, z5, z11, z13 DCTELEM dataptr
int ctr SHIFT_TEMPS / Pass 1 process rows.
/ dataptr data for (ctr DCTSIZE-1 ctr
gt 0 ctr--) tmp0 dataptr0
dataptr7 tmp7 dataptr0 - dataptr7
tmp1 dataptr1 dataptr6 //
Thousands of lines like this in dozens of files
28
Why No Successful Tool Yet?
  • Most research has focused on extensive
    exploration
  • Roots in VLSI CAD
  • Decompose problem into fine-grained operations
  • Apply sophisticated partitioning algorithms
  • Examples
  • Min-cut, dynamic programming, simulated
    annealing, tabu-search, genetic evolution, etc.
  • Is this overkill?

1000s of nodes (like circuit partitioning)
Partitioner
29
We Really Only Need Consider a Few Loops Due to
the 90-10 Rule
  • Recent appearance of embedded benchmark suites
  • Enables analysis ? understanding of the real
    problem
  • Weve examined UCLAs MediaBench, Netbench,
    Motorolas Powerstone
  • Currently examining EEMBC (embedded equivalent of
    SPEC)
  • UCR loop analysis tools based on SimpleScalar and
    Simics

// From MediaBenchs JPEG codec GLOBAL(void) jpeg_
fdct_ifast (DCTELEM data) DCTELEM tmp0,
tmp1, tmp2, tmp3, tmp4, tmp5, tmp6, tmp7
DCTELEM tmp10, tmp11, tmp12, tmp13 DCTELEM z1,
z2, z3, z4, z5, z11, z13 DCTELEM dataptr
int ctr SHIFT_TEMPS / Pass 1 process rows.
/ dataptr data for (ctr DCTSIZE-1 ctr
gt 0 ctr--) tmp0 dataptr0
dataptr7 tmp7 dataptr0 - dataptr7
tmp1 dataptr1 dataptr6
Assigned each loop a number, sorted by fraction
of contribution to total execution time

30
The 90-10 Rule Holds for Embedded Systems
In fact, the most frequent loop alone took 50 of
time, using 1 of code
31
So Need We Only Consider the First Few Loops? Not
Necessarily
  • What if programs were self-similar w.r.t. 90-10
    rule?
  • Remove most frequent loop 90-10 rule still
    hold?
  • Intuition might say yes remove loop, and we
    have another program.
  • So we need only speedup the first few loops
  • After that, speedups are limited
  • Good from tool perspective!

32
Manually Partitioned Several PowerStone
Benchmarks onto Triscend A7 and E5 Chips
E5 IC
  • Used multimeter and timer to measure performance
    and power
  • Obtained good speedups and energy savings by
    partitioning software among microprocessor and
    on-chip FPGA

Triscend A7 development board
33
Simulation-Based Results for More Benchmarks
(Quicker than physical implementation, results
matched reasonably well)
34
Looking at Multiple Loops per Benchmark
  • Manually created several partitioned versions of
    each benchmarks
  • Most speedup gained with first 20,000 gates
  • Surprisingly few gates!
  • Stitt, Grattan and Vahid, Field-programmable
    Custom Computing Machines (FCCM) 2002
  • Stitt and Vahid, IEEE Design and Test, Dec. 2002
  • J. Villarreal, D. Suresh, G. Stitt, F. Vahid and
    W. Najjar, Design Automation of Embedded Systems,
    2002 (to appear).

35
Ideal Speedups for Different Architectures
  • Varied loop speedup ratio (sw time / hw time of
    loop itself) to see impact of faster
    microprocessor or slower FPGA 30, 20, 10 (base
    case), 5 and 2
  • Loop speedups of 5 or more work fine for first
    few loops, not hard to achieve

36
Ideal Energy Savings for Different Architectures
  • Varied loop power ratio (FPGA power /
    microprocessor power) to account for different
    architectures 2.5, 2.0, 1.5 (base case), 1.0
  • Energy savings quite resilient to variations

37
How is Automated Partitioning Done?
Previous data obtained manually
System Partitioning
Sw spec
Hw spec
Sw design
Hw design
Partitioning
Processor FPGA
ASIC
38
Source-Level Partitioning
SW Source _______ _______ _______
Front-end converts code into intermediate format,
such as SUIF (Stanford University Intermediate
Format)
Compiler Front-End
Intermediate format explored for hardware
candidates
Hw/Sw Partitioning
Compiler Back-End
Hw source
Assembly object files
Binary is generated from assembling and linking.
Hw source is generated and synthesized into
netlist
Assembler Linker
Synthesis
Binary
Netlists
Processor
FPGA
39
Problems with Source-Level Partitioning
  • Though technically superior, source-level
    partitioning
  • Disrupts standard commercial tool flow
    significantly
  • Requires special compiler (ouch!)
  • Multiple source languages, changing source
    languages
  • How deal with library code, assembly code, object
    code

Compiler Front-end
C Source
C Source
Java Source
?
C SUIF Compiler
C SUIF Compiler
40
Binary Partitioning
SW Source _______ _______ _______
Compilation
Assembly object files
Source code is first compiled and linked in order
to create a binary.
Assembler Linker
Binary
Candidate hardware regions (a few small, frequent
loops) are decompiled for partitioning
Hw/Sw Partitioning
Hw source
Updated Binary
HDL is generated and synthesized, and binary is
updated to use hardware
Synthesis
Netlists
Processor
FPGA
41
Binary-Level Partitioning Results (ICCAD02)
  • Binary-Level
  • Average speedup, 1.4
  • Average energy savings, 13
  • Large area overhead averaging 10,325 gates
  • Source-Level
  • Average speedup, 1.5
  • Average energy savings, 27
  • Average 4,361 gates

42
Binary Partitioning Could Eventually Lead to
Dynamic Hw/Sw Partitioning
  • Dynamic software optimization gaining interest
  • e.g., HPs Dynamo
  • What better optimization than moving to FPGA?
  • Add component on-chip
  • Detects most frequent sw loops
  • Decompiles a loop
  • Performs compiler optimizations
  • Synthesizes to a netlist
  • Places and routes the netlist onto (simple) FPGA
  • Updates sw to call FPGA
  • Self-improving IC
  • Can be invisible to designer
  • Appears as efficient processor
  • HARD! Much future work.

43
Conclusions
  • Hardware/software partitioning can significantly
    improve software speed and energy
  • Single-chip microprocessor/FPGA platforms,
    increasing in popularity, make such partitioning
    even more attractive
  • Successful commercial tool still on the horizon
  • Binary-level partitioning may help in some cases
  • Source-level can yield massive parallelism
    (Profs. Najjar/Payne)
  • Future dynamic hw/sw partitioning possible?
  • Distinction between sw/hw continually being
    blurred!
  • Many people involved
  • Greg Stitt, Roman Lysecky, Shawn Nematbakhsh,
    Dinesh Suresh, Walid Najjar, Jason Villarreal,
    Tom Payne, several others
  • Support from NSF, Triscend, and soon SRC
  • Exciting new directions!
Write a Comment
User Comments (0)
About PowerShow.com