Portability for FPGA ApplicationsWarp Processing and SystemC Bytecode - PowerPoint PPT Presentation

About This Presentation
Title:

Portability for FPGA ApplicationsWarp Processing and SystemC Bytecode

Description:

Portable Applications on PCs. Standard software binary. Dynamic software ... 10x speedups for some apps. Warp speed, Scotty. 16 /52. Frank Vahid, UC Riverside ... – PowerPoint PPT presentation

Number of Views:65
Avg rating:3.0/5.0
Slides: 53
Provided by: vah
Learn more at: http://www.cs.ucr.edu
Category:

less

Transcript and Presenter's Notes

Title: Portability for FPGA ApplicationsWarp Processing and SystemC Bytecode


1
Portability for FPGA ApplicationsWarp Processing
and SystemC Bytecode
  • Contributing Ph.D. Students
  • Roman Lysecky (Ph.D. 2005, now Asst. Prof. at
    Univ. of Arizona
  • Greg Stitt (Ph.D. 2007, now Asst. Prof. at Univ.
    of Florida, Gainesville
  • Scotty Sirowy (current)
  • David Sheldon (current)
  • Chen Huang (current)

Frank Vahid Dept. of CSE University of
California, Riverside Associate Director, Center
for Embedded Computer Systems, UC Irvine
This research was supported in part by the
National Science Foundation, the Semiconductor
Research Corporation, Intel, Freescale, IBM, and
Xilinx
2
Portable Applications on PCs
One binary
x86 binary
How? Why?
Pentium
Opteron
Atom
Dual Core
Multiple platforms
3
Portable Applications on PCs
  • Standard software binary
  • Dynamic software binary translation

x86 Binary
VLIW
x86 µP
VLIW Binary
SW binary translation
4
Meanwhile, Circuits on FPGAs Show Large Speedups
  • Int. Symp. on FPGAs, FCCM, FPL, CODES/ISSS, ICS,
    MICRO, CASES, DAC, DATE, ICCAD, RAW,

5
FPGAs Entering Computing Mainstream
  • AMD Opteron
  • Intel QuickAssist
  • Cray, SGI
  • Mitrionics
  • IBM Cell (research)
  • Xilinx, Altera

SGI Altix supercomputer (UCR 64 Itaniums plus 2
FPGA RASCs)
6
Circuits on FPGAs are Software Binaries
Microprocessor Binaries (Instructions)
FPGA Binaries (Circuits)
not hardware
aka "bitstream"
Bits loaded into LUTs and SMs
Bits loaded into program memory
FPGA
0111
0010
7
Portable Applications FPGAs
  • Standard software binary
  • Dynamic translation

x86 Binary
VLIW
x86 µP
VLIW Binary
SW binary translation
8
Warp Processing
1
Initially, software binary loaded into
instruction memory
Profiler
I Mem
µP
D
FPGA
On-chip CAD
9
Warp Processing
2
Microprocessor executes instructions in software
binary
Profiler
I Mem
µP
D
FPGA
On-chip CAD
10
Warp Processing
3
Profiler monitors instructions and detects
critical regions in binary
Profiler
Profiler
I Mem
µP
µP
beq
beq
beq
beq
beq
beq
beq
beq
beq
beq
add
add
add
add
add
add
add
add
add
add
D
FPGA
On-chip CAD
11
Warp Processing
4
On-chip CAD reads in critical region
Profiler
Profiler
I Mem
µP
µP
D
FPGA
On-chip CAD
On-chip CAD
12
Warp Processing
5
On-chip CAD decompiles critical region into
control data flow graph (CDFG)
Profiler
Profiler
I Mem
µP
µP
D
Recover loops, arrays, subroutines, etc. needed
to synthesize good circuits
FPGA
Dynamic Part. Module (DPM)
On-chip CAD
13
Warp Processing
6
On-chip CAD synthesizes decompiled CDFG to a
custom (parallel) circuit
Profiler
Profiler
I Mem
µP
µP
D
FPGA
Dynamic Part. Module (DPM)
On-chip CAD
14
Warp Processing
7
On-chip CAD maps circuit onto FPGA
Profiler
Profiler
I Mem
µP
µP
D
FPGA
FPGA
Dynamic Part. Module (DPM)
On-chip CAD


15
Warp Processing
gt10x speedups for some apps
On-chip CAD replaces instructions in binary to
use hardware, causing performance and energy to
warp by an order of magnitude or more
8
Mov reg3, 0 Mov reg4, 0 loop // instructions
that interact with FPGA Ret reg4
Profiler
Profiler
I Mem
µP
µP
D
FPGA
FPGA
Dynamic Part. Module (DPM)
On-chip CAD


16
Warp Processing Challenges
  • Can we decompile binaries sufficiently for
    synthesis?
  • Can we just-in-time (JIT) compile to FPGAs?

Profiling partitioning
Decompilation
Profiler
µP
I
D
CDFG
Binary Updater
FPGA
On-chip CAD
JIT FPGA compilation
FPGA binary
Binary
Microp Binary
Binary
17
Decompilation
  • Recover high-level information from binary
    branches, loops, arrays, subroutines,
  • Adapted previous methods for processor-processor
    translation (UQBT)
  • Developed new synthesis-oriented methods (e.g.,
    reroll loops, strength promotion)

Corresponding Assembly
Original C Code
Mov reg3, 0 Mov reg4, 0 loop Shl reg1, reg3,
1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4,
reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret
reg4
long f( short a10 ) long accum for
(int i0 i lt 10 i) accum ai
return accum
18
Decompilation Results vs. C
  • Synthesis from decompiled binary is competitive
    with synthesis from C

19
Decompilation Results on Optimized H.264In-depth
Study with Freescale
  • Again, competitive with synthesis from C

20
Decompilation Effective Even with Compiler
Optimizations
  • Do compiler optimizations hurt decompilation?
  • (Surprisingly) found optimized code synthesizes
    to even better circuits

Speedup when decompiled binary is partitioned and
synthesized to FPGA
Average Speedup of 10 Examples
21
Decompilation
Summary Decompilation is surprisingly effective
at recovering high-level program structures for
synthesis Stitt et al ICCAD02, DAC03,
CODES/ISSS05, ICCAD05, FPGA05, TODAES06,
TODAES07
Ph.D. work of Greg Stitt (Ph.D. UCR 2007, now
Asst. Prof. at UF Gainesville)
22
Warp Processing Challenges
  • Can we decompile binaries sufficiently for
    synthesis?
  • Can we just-in-time (JIT) compile to FPGAs?

Profiling partitioning
Decompilation
Profiler
µP
I
D
CDFG
Binary Updater
FPGA
On-chip CAD
JIT FPGA compilation
FPGA binary
Binary
Microp Binary
Binary
23
Challenge JIT Compile to FPGA
60 MB
Commercial tool
Logic synthesis
Tech. map.
Placement
Routing
9.1 s
  • Developed ultra-lean CAD heuristics for
    synthesis, placement, routing, and technology
    mapping, e.g.,
  • Logic synthesis run single expand phase
  • Technology mapping bottom-up graph clustering
    heuristic
  • Placement place critical path first, then
    adjacent items
  • Routing use resource graph that matches switch
    matrix / channel structure

Ultra-lean Riverside JIT FPGA tools (drawn to
scale)
Penalty 1.3-2x in performance size (even more
might be acceptable)
0.2 s
24
JIT Compile to FPGA
Summary Ultra-lean JIT FPGA compiler ? 40x
speedup, 20x less memory, 1.3x-2x circuit
penalty Lysecky et al, DAC03, ISSS/CODES03,
DATE04, DAC04, DATE05, FCCM05, TODAES06
Ph.D. work of Roman Lysecky (Ph.D. UCR 2005, now
Asst. Prof. at Univ. of Arizona)
25
Warp Processing ResultsPerformance Speedup (Most
Frequent Kernel Only)
vs. 200 MHz ARM
1 ARM-only execution
Overall application speedup average is 7.4
26
Warping Thread-Based Applications
for (i 0 i lt 10 i) thread_create( f, i
)
Multi-core platforms ? multi-threaded apps
Performance
OS schedules threads onto accelerators (possibly
dozens), in addition to µPs
Compiler
Very large speedups possible parallelism at
bit, arithmetic, and now thread level too
µP
µP
FPGA
Binary
f()
OS schedules threads onto available µPs
µP
µP
µP
f()
OS
OS invokes on-chip CAD tools to create
accelerators for f()
Thread warping use one core to create
accelerator for waiting threads
Remaining threads added to queue
27
Memory Access Synchronization (MAS)
  • Must deal with widely known memory bottleneck
    problem
  • FPGAs great, but often cant get data to them
    fast enough

for (i 0 i lt 10 i) thread_create(
thread_function, a, i )
RAM
DMA
Data for dozens of threads can create bottleneck
void f( int a, int val ) int result
for (i 0 i lt 10 i) result ai
val . . . .
FPGA
.
Same array
  • Threaded programs exhibit unique feature
    Multiple threads often access same or overlapping
    data
  • Solution Fetch data once, broadcast to multiple
    threads (MAS)

28
Memory Access Synchronization (MAS)
  • Detect overlapping memory regions windows
  • Synthesis creates active smart buffer
    Guo/Najjar FPGA04
  • Actively fetches data, stores the reused data,
    delivers windows to threads
  • Active rather than passive component designed
    for specific threads


a0
a1
a2
a3
a4
a5
for (i 0 i lt 100 i) thread_create(
thread_function, a, i )
Data streamed to smart buffer
DMA
RAM
void f( int a, int i ) int result
result aiai1ai2ai3 . . . .
A0-103
Smart Buffer
A0-3
A6-9
A1-4

Each thread accesses different addresses but
addresses may overlap
Buffer delivers window to each thread
W/O smart buffer 400 memory accesses With smart
buffer 104 memory accesses
29
Speedups from Thread Warping
  • Chose benchmarks with extensive parallelism
  • Four core (ARM11 400 MHz) base system
  • Virtex IV FPGA at circuit-specific clock
    frequency (100-300 MHz)
  • Average 130x speedup

But, FPGA uses additional area. Our FPGA size
36 ARM11s
  • Still 20x faster than 32-core system (and 11x
    faster than 64-core)
  • Simulation pessimistic, actual results likely
    better
  • FPGA more flexible

30
Warp Scenarios
Warping takes time (seconds, minutes, or more)
when useful?
  • Long-running applications
  • Scientific computing, etc.
  • Recurring applications (save and reuse FPGA
    configurations)
  • Common in embedded systems
  • Might view as (long) boot phase
  • For networked/docked devices, CAD can occur on
    server (ongoing work)

Long Running Applications
Recurring Applications
µP (1st execution)
On-chip CAD
µP
Time
Time
31
Why Dynamic?
  • Static good, but hiding FPGA opens technique to
    all sw platforms
  • Standard languages/tools/binaries

Static Compiling to FPGAs
Dynamic Compiling to FPGAs
Specialized Language
Any Language
Specialized Compiler
Any Compiler
Binary
Netlist
Binary
FPGA
µP
32
Synthesis-Friendly Applications
  • Coding style impacts synthesis results

33
Synthesis-Friendly Application Coding Guidelines
Coding Guidelines
34
Conversion to Explicit Control Flow (CECF)
  • Problem Function pointers may prevent static
    control flow analysis
  • Guideline Dont use function pointers. Replace
    with if-else, static calls
  • Makes possible targets explicit

void f( int (fp) (int) ) . . . . . for
(i0 i lt 10 i) ai fp(i)
enum Target FUNC1, FUNC2, FUNC3 void f( enum
Target fp ) . . . . . for (i0 i lt 10
i) if (fp FUNC1) ai
f1(i) else if (fp FUNC2) ai
f2(i) else ai f3(i)
35
Speedups from Synthesis-Friendly Coding
Guidelines
  • 10 guidelines
  • For 1,000 line benchmark 5-6 changes typical,
    tens of minutes each

36
Speedups from Synthesis-Friendly Coding Guidelines
  • Original C code (Powerstone, Mediabench)
  • Original average speedups with FPGA 2.6x
    (excludes brev)
  • Refined C code with guidelines
  • Average speedup 8.4x (excludes brev)
  • Guidelines led to 3.5x improvement of speedup

37
Spatial Algorithms for FPGAs
  • As FPGAs more common app writers may expect
    FPGA presence
  • Example Count patterns
  • Sequential algorithm
  • Hash table
  • 10s cycles per pattern
  • Spatial algorithm (for FPGA)
  • Pipelined stages

Current pattern
count
pattern
logic
Level 1
count
pattern
logic
Level 2
count
pattern
logic
Level 3
count
pattern
logic
Level 4
. . .
count
pattern
logic
Spatial algorithm Essence is the connectivity of
components, not the sequencing of instructions
Level m
. . .
38
Spatial Algorithms for FPGAs
Current pattern
  • Spatial algorithm 2
  • Pipelined binary tree

1 Count
Memory 1 pattern
logic
Level 1
2 Count 2 patterns
Memory 2 patterns
logic
Level 2
4 Count 4 patterns
Memory 4 patterns
logic
Level 3
. . .
2n Count 2n patterns
Memory 2n patterns
logic
Level n
. . .
39
Example
48
73
Possible patterns pre-stored in binary search
tree circuit
Stage 1
Stage 2
Stage 3
Stage 4
40
Example
23
48
Stage 1
73
Stage 2
Stage 3
Stage 4
41
Example
75
23
Stage 1
48
Stage 2
73
Stage 3
Stage 4
42
Example
11
75
Stage 1
23
Stage 2
48
Stage 3
73
Stage 4
1
43
Example
11
Stage 1
75
Stage 2
1
23
Stage 3
48
Stage 4
1
1
44
Study of Spatial Algorithms in FCCM
Year Application Type 2001 3D Vec.
Normalization Spatial 2001 Efficient CAM
-- 2001 Automated Sensor Temporal 2001 Regular
Expression Spatial 2002 Hyperspectral
Image Spatial 2002 Machine Vision Spatial 2002
RC4 Temporal 2002 Set Covering Spatial 2002 Te
mplate Matching Spatial 2002 Triangle
Mesh Spatial 2003 Congruential
Sieves Temporal 2003 Content Scanning Temporal 2
003 F.P and Square Root Spatial 2003 Gaussian
Noise Spatial 2003 TRNG -- 2004 3D FDTD
Method Spatial 2004 Deep Packet
Filter -- 2004 Online Floating
Point -- 2004 Molecular Dynamics Spatial 2004 Pa
ttern Matching Spatial 2004 Seismic
Migration Spatial 2004 Software
Deceleration -- 2004 V.M Window -- 2005 Data
Mining Spatial 2005 Cell Automata Temporal 200
5 Particle Graphics Spatial 2005 Radiosity Tempo
ral 2005 Transient Waves Spatial 2005 Road
Traffic Temporal 2006 All Pairs Shortest
Path Spatial 2006 Apriori Data
Mining Spatial 2006 Molecular Dynamics Spatial 2
006 Gaussian Elimination Spatial 2006 Radiation
Dose Temporal 2006 Random Variates Spatial
  • FCCM 2001-2006
  • 70 papers describing fast application on FPGA
  • Examined 35 in depth (every other one)
  • 6 used device-specific features
  • 9 represented expected synthesized circuit from
    the obvious sequential algorithm
  • 20 were spatially-oriented applications
  • e.g., earlier pipelined binary tree

45
Portable Spatial Applications?
  • Current portable microprocessor binaries
    sequential
  • Extensions for threads, processes, ...
  • How support spatial constructs
  • Ports, connections, timing model
  • .....
  • Adds libraries and macros, still standard C
  • Sequential and spatial constructs
  • Compiling links in the simulation kernel
  • Self-executing simulation
  • Intended for SoC simulation

www.systemc.org
46
(No Transcript)
47
Bytecode
  • Modern portability approach
  • Java, C

Compiler
Virtual Machine (VM) Program that executes
bytecode May JIT compile to native architecture
bytecode
VM
VM
VM
Pentium
Opteron
Atom
48
SystemC Bytecode?
SystemC
Compiler
SystemC bytecode
VM
VM
VM
Opteron FPGA
Pentium
FPGA
49
SystemC Bytecode Compiler
class EDGE_DETECTOR public sc_module //signal
declarations EDGE_DETECTOR()
SC_method(mainComp) sensitive ltlt dataReady
SC_method(getPixel) sensitive ltlt
clock.pos() void getPixel()
dataReady.write(1) void mainComp() int
i, j for(i 0 i lt 3 i) for(j
0 j lt 3 j) sumX sumX
mem.read()GXij
edge.write(sumX sumY)
SystemC
SystemC Bytecode Compiler
Pinapa Front End
AST
Link
ELAB
SystemC bytecode
Bytecode Back End
Code Generation 1
Register Allocation
50
SystemC Bytecode Emulator
SystemC bytecode
Bytecode uploadable via USB drive
Warping also possible JIT compile bytecode
portions to circuits on FPGA
FPGA
Accelerators speedup emulation
51
Dynamic Enables Expandable Logic Concept
RAM
Expandable Logic
Expandable RAM
uP
Performance
52
Summary
  • FPGAs entering mainstream
  • Portability of applications is important
  • Dynamic binary translation to FPGAs Warp
    processing
  • Shown feasible Extensive future work
  • Trends towards FPGA ubiquity
  • Microprocessor binaries need extensions for
    spatial constructs
  • One approach SystemC bytecode and virtual
    machine
  • Can also be warped for circuit-speed

http//www.cs.ucr.edu/vahid/pubs
Write a Comment
User Comments (0)
About PowerShow.com