Portability for FPGA ApplicationsWarp Processing and SystemC Bytecode - PowerPoint PPT Presentation

About This Presentation

Title:

Portability for FPGA ApplicationsWarp Processing and SystemC Bytecode

Description:

Portable Applications on PCs. Standard software binary. Dynamic software ... 10x speedups for some apps. Warp speed, Scotty. 16 /52. Frank Vahid, UC Riverside ... – PowerPoint PPT presentation

Number of Views:65

Avg rating:3.0/5.0

Slides: 53

Provided by: vah

Learn more at: http://www.cs.ucr.edu

Category:

more less

Transcript and Presenter's Notes

Title: Portability for FPGA ApplicationsWarp Processing and SystemC Bytecode

1
Portability for FPGA ApplicationsWarp Processing
and SystemC Bytecode

Contributing Ph.D. Students
Roman Lysecky (Ph.D. 2005, now Asst. Prof. at
Univ. of Arizona
Greg Stitt (Ph.D. 2007, now Asst. Prof. at Univ.
of Florida, Gainesville
Scotty Sirowy (current)
David Sheldon (current)
Chen Huang (current)

Frank Vahid Dept. of CSE University of
California, Riverside Associate Director, Center
for Embedded Computer Systems, UC Irvine
This research was supported in part by the
National Science Foundation, the Semiconductor
Research Corporation, Intel, Freescale, IBM, and
Xilinx
2
Portable Applications on PCs
One binary
x86 binary
How? Why?
Pentium
Opteron
Atom
Dual Core
Multiple platforms
3
Portable Applications on PCs

Standard software binary
Dynamic software binary translation

x86 Binary
VLIW
x86 µP
VLIW Binary
SW binary translation
4
Meanwhile, Circuits on FPGAs Show Large Speedups

Int. Symp. on FPGAs, FCCM, FPL, CODES/ISSS, ICS,
MICRO, CASES, DAC, DATE, ICCAD, RAW,

5
FPGAs Entering Computing Mainstream

AMD Opteron
Intel QuickAssist
Cray, SGI
Mitrionics
IBM Cell (research)
Xilinx, Altera

SGI Altix supercomputer (UCR 64 Itaniums plus 2
FPGA RASCs)
6
Circuits on FPGAs are Software Binaries
Microprocessor Binaries (Instructions)
FPGA Binaries (Circuits)
not hardware
aka "bitstream"
Bits loaded into LUTs and SMs
Bits loaded into program memory
FPGA
0111
0010
7
Portable Applications FPGAs

Standard software binary
Dynamic translation

x86 Binary
VLIW
x86 µP
VLIW Binary
SW binary translation
8
Warp Processing
1
Initially, software binary loaded into
instruction memory
Profiler
I Mem
µP
D
FPGA
On-chip CAD
9
Warp Processing
2
Microprocessor executes instructions in software
binary
Profiler
I Mem
µP
D
FPGA
On-chip CAD
10
Warp Processing
3
Profiler monitors instructions and detects
critical regions in binary
Profiler
Profiler
I Mem
µP
µP
beq
beq
beq
beq
beq
beq
beq
beq
beq
beq
add
add
add
add
add
add
add
add
add
add
D
FPGA
On-chip CAD
11
Warp Processing
4
On-chip CAD reads in critical region
Profiler
Profiler
I Mem
µP
µP
D
FPGA
On-chip CAD
On-chip CAD
12
Warp Processing
5
On-chip CAD decompiles critical region into
control data flow graph (CDFG)
Profiler
Profiler
I Mem
µP
µP
D
Recover loops, arrays, subroutines, etc. needed
to synthesize good circuits
FPGA
Dynamic Part. Module (DPM)
On-chip CAD
13
Warp Processing
6
On-chip CAD synthesizes decompiled CDFG to a
custom (parallel) circuit
Profiler
Profiler
I Mem
µP
µP
D
FPGA
Dynamic Part. Module (DPM)
On-chip CAD
14
Warp Processing
7
On-chip CAD maps circuit onto FPGA
Profiler
Profiler
I Mem
µP
µP
D
FPGA
FPGA
Dynamic Part. Module (DPM)
On-chip CAD

15
Warp Processing
gt10x speedups for some apps
On-chip CAD replaces instructions in binary to
use hardware, causing performance and energy to
warp by an order of magnitude or more
8
Mov reg3, 0 Mov reg4, 0 loop // instructions
that interact with FPGA Ret reg4
Profiler
Profiler
I Mem
µP
µP
D
FPGA
FPGA
Dynamic Part. Module (DPM)
On-chip CAD

16
Warp Processing Challenges

Can we decompile binaries sufficiently for
synthesis?
Can we just-in-time (JIT) compile to FPGAs?

Profiling partitioning
Decompilation
Profiler
µP
I
D
CDFG
Binary Updater
FPGA
On-chip CAD
JIT FPGA compilation
FPGA binary
Binary
Microp Binary
Binary
17
Decompilation

Recover high-level information from binary
branches, loops, arrays, subroutines,
Adapted previous methods for processor-processor
translation (UQBT)
Developed new synthesis-oriented methods (e.g.,
reroll loops, strength promotion)

Corresponding Assembly
Original C Code
Mov reg3, 0 Mov reg4, 0 loop Shl reg1, reg3,
1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4,
reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret
reg4
long f( short a10 ) long accum for
(int i0 i lt 10 i) accum ai
return accum
18
Decompilation Results vs. C

Synthesis from decompiled binary is competitive
with synthesis from C

19
Decompilation Results on Optimized H.264In-depth
Study with Freescale

Again, competitive with synthesis from C

20
Decompilation Effective Even with Compiler
Optimizations

Do compiler optimizations hurt decompilation?
(Surprisingly) found optimized code synthesizes
to even better circuits

Speedup when decompiled binary is partitioned and
synthesized to FPGA
Average Speedup of 10 Examples
21
Decompilation
Summary Decompilation is surprisingly effective
at recovering high-level program structures for
synthesis Stitt et al ICCAD02, DAC03,
CODES/ISSS05, ICCAD05, FPGA05, TODAES06,
TODAES07
Ph.D. work of Greg Stitt (Ph.D. UCR 2007, now
Asst. Prof. at UF Gainesville)
22
Warp Processing Challenges

Can we decompile binaries sufficiently for
synthesis?
Can we just-in-time (JIT) compile to FPGAs?

Profiling partitioning
Decompilation
Profiler
µP
I
D
CDFG
Binary Updater
FPGA
On-chip CAD
JIT FPGA compilation
FPGA binary
Binary
Microp Binary
Binary
23
Challenge JIT Compile to FPGA
60 MB
Commercial tool
Logic synthesis
Tech. map.
Placement
Routing
9.1 s

Developed ultra-lean CAD heuristics for
synthesis, placement, routing, and technology
mapping, e.g.,
Logic synthesis run single expand phase
Technology mapping bottom-up graph clustering
heuristic
Placement place critical path first, then
adjacent items
Routing use resource graph that matches switch
matrix / channel structure

Ultra-lean Riverside JIT FPGA tools (drawn to
scale)
Penalty 1.3-2x in performance size (even more
might be acceptable)
0.2 s
24
JIT Compile to FPGA
Summary Ultra-lean JIT FPGA compiler ? 40x
speedup, 20x less memory, 1.3x-2x circuit
penalty Lysecky et al, DAC03, ISSS/CODES03,
DATE04, DAC04, DATE05, FCCM05, TODAES06
Ph.D. work of Roman Lysecky (Ph.D. UCR 2005, now
Asst. Prof. at Univ. of Arizona)
25
Warp Processing ResultsPerformance Speedup (Most
Frequent Kernel Only)
vs. 200 MHz ARM
1 ARM-only execution
Overall application speedup average is 7.4
26
Warping Thread-Based Applications
for (i 0 i lt 10 i) thread_create( f, i
)
Multi-core platforms ? multi-threaded apps
Performance
OS schedules threads onto accelerators (possibly
dozens), in addition to µPs
Compiler
Very large speedups possible parallelism at
bit, arithmetic, and now thread level too
µP
µP
FPGA
Binary
f()
OS schedules threads onto available µPs
µP
µP
µP
f()
OS
OS invokes on-chip CAD tools to create
accelerators for f()
Thread warping use one core to create
accelerator for waiting threads
Remaining threads added to queue
27
Memory Access Synchronization (MAS)

Must deal with widely known memory bottleneck
problem
FPGAs great, but often cant get data to them
fast enough

for (i 0 i lt 10 i) thread_create(
thread_function, a, i )
RAM
DMA
Data for dozens of threads can create bottleneck
void f( int a, int val ) int result
for (i 0 i lt 10 i) result ai
val . . . .
FPGA
.
Same array

Threaded programs exhibit unique feature
Multiple threads often access same or overlapping
data
Solution Fetch data once, broadcast to multiple
threads (MAS)

28
Memory Access Synchronization (MAS)

Detect overlapping memory regions windows

Synthesis creates active smart buffer
Guo/Najjar FPGA04
Actively fetches data, stores the reused data,
delivers windows to threads
Active rather than passive component designed
for specific threads

a0
a1
a2
a3
a4
a5
for (i 0 i lt 100 i) thread_create(
thread_function, a, i )
Data streamed to smart buffer
DMA
RAM
void f( int a, int i ) int result
result aiai1ai2ai3 . . . .
A0-103
Smart Buffer
A0-3
A6-9
A1-4

Each thread accesses different addresses but
addresses may overlap
Buffer delivers window to each thread
W/O smart buffer 400 memory accesses With smart
buffer 104 memory accesses
29
Speedups from Thread Warping

Chose benchmarks with extensive parallelism
Four core (ARM11 400 MHz) base system
Virtex IV FPGA at circuit-specific clock
frequency (100-300 MHz)
Average 130x speedup

But, FPGA uses additional area. Our FPGA size
36 ARM11s

Still 20x faster than 32-core system (and 11x
faster than 64-core)
Simulation pessimistic, actual results likely
better
FPGA more flexible

30
Warp Scenarios
Warping takes time (seconds, minutes, or more)
when useful?

Long-running applications
Scientific computing, etc.

Recurring applications (save and reuse FPGA
configurations)
Common in embedded systems
Might view as (long) boot phase
For networked/docked devices, CAD can occur on
server (ongoing work)

Long Running Applications
Recurring Applications
µP (1st execution)
On-chip CAD
µP
Time
Time
31
Why Dynamic?

Static good, but hiding FPGA opens technique to
all sw platforms
Standard languages/tools/binaries

Static Compiling to FPGAs
Dynamic Compiling to FPGAs
Specialized Language
Any Language
Specialized Compiler
Any Compiler
Binary
Netlist
Binary
FPGA
µP
32
Synthesis-Friendly Applications

Coding style impacts synthesis results

33
Synthesis-Friendly Application Coding Guidelines
Coding Guidelines
34
Conversion to Explicit Control Flow (CECF)

Problem Function pointers may prevent static
control flow analysis
Guideline Dont use function pointers. Replace
with if-else, static calls
Makes possible targets explicit

void f( int (fp) (int) ) . . . . . for
(i0 i lt 10 i) ai fp(i)
enum Target FUNC1, FUNC2, FUNC3 void f( enum
Target fp ) . . . . . for (i0 i lt 10
i) if (fp FUNC1) ai
f1(i) else if (fp FUNC2) ai
f2(i) else ai f3(i)
35
Speedups from Synthesis-Friendly Coding
Guidelines

10 guidelines
For 1,000 line benchmark 5-6 changes typical,
tens of minutes each

36
Speedups from Synthesis-Friendly Coding Guidelines

Original C code (Powerstone, Mediabench)
Original average speedups with FPGA 2.6x
(excludes brev)
Refined C code with guidelines
Average speedup 8.4x (excludes brev)
Guidelines led to 3.5x improvement of speedup

37
Spatial Algorithms for FPGAs

As FPGAs more common app writers may expect
FPGA presence
Example Count patterns
Sequential algorithm
Hash table
10s cycles per pattern
Spatial algorithm (for FPGA)
Pipelined stages

Current pattern
count
pattern
logic
Level 1
count
pattern
logic
Level 2
count
pattern
logic
Level 3
count
pattern
logic
Level 4
. . .
count
pattern
logic
Spatial algorithm Essence is the connectivity of
components, not the sequencing of instructions
Level m
. . .
38
Spatial Algorithms for FPGAs
Current pattern

Spatial algorithm 2
Pipelined binary tree

1 Count
Memory 1 pattern
logic
Level 1
2 Count 2 patterns
Memory 2 patterns
logic
Level 2
4 Count 4 patterns
Memory 4 patterns
logic
Level 3
. . .
2n Count 2n patterns
Memory 2n patterns
logic
Level n
. . .
39
Example
48
73
Possible patterns pre-stored in binary search
tree circuit
Stage 1
Stage 2
Stage 3
Stage 4
40
Example
23
48
Stage 1
73
Stage 2
Stage 3
Stage 4
41
Example
75
23
Stage 1
48
Stage 2
73
Stage 3
Stage 4
42
Example
11
75
Stage 1
23
Stage 2
48
Stage 3
73
Stage 4
1
43
Example
11
Stage 1
75
Stage 2
1
23
Stage 3
48
Stage 4
1
1
44
Study of Spatial Algorithms in FCCM
Year Application Type 2001 3D Vec.
Normalization Spatial 2001 Efficient CAM
-- 2001 Automated Sensor Temporal 2001 Regular
Expression Spatial 2002 Hyperspectral
Image Spatial 2002 Machine Vision Spatial 2002
RC4 Temporal 2002 Set Covering Spatial 2002 Te
mplate Matching Spatial 2002 Triangle
Mesh Spatial 2003 Congruential
Sieves Temporal 2003 Content Scanning Temporal 2
003 F.P and Square Root Spatial 2003 Gaussian
Noise Spatial 2003 TRNG -- 2004 3D FDTD
Method Spatial 2004 Deep Packet
Filter -- 2004 Online Floating
Point -- 2004 Molecular Dynamics Spatial 2004 Pa
ttern Matching Spatial 2004 Seismic
Migration Spatial 2004 Software
Deceleration -- 2004 V.M Window -- 2005 Data
Mining Spatial 2005 Cell Automata Temporal 200
5 Particle Graphics Spatial 2005 Radiosity Tempo
ral 2005 Transient Waves Spatial 2005 Road
Traffic Temporal 2006 All Pairs Shortest
Path Spatial 2006 Apriori Data
Mining Spatial 2006 Molecular Dynamics Spatial 2
006 Gaussian Elimination Spatial 2006 Radiation
Dose Temporal 2006 Random Variates Spatial

FCCM 2001-2006
70 papers describing fast application on FPGA
Examined 35 in depth (every other one)
6 used device-specific features
9 represented expected synthesized circuit from
the obvious sequential algorithm
20 were spatially-oriented applications
e.g., earlier pipelined binary tree

45
Portable Spatial Applications?

Current portable microprocessor binaries
sequential
Extensions for threads, processes, ...
How support spatial constructs
Ports, connections, timing model
.....

Adds libraries and macros, still standard C
Sequential and spatial constructs
Compiling links in the simulation kernel
Self-executing simulation
Intended for SoC simulation

www.systemc.org
46
(No Transcript)
47
Bytecode

Modern portability approach
Java, C

Compiler
Virtual Machine (VM) Program that executes
bytecode May JIT compile to native architecture
bytecode
VM
VM
VM
Pentium
Opteron
Atom
48
SystemC Bytecode?
SystemC
Compiler
SystemC bytecode
VM
VM
VM
Opteron FPGA
Pentium
FPGA
49
SystemC Bytecode Compiler
class EDGE_DETECTOR public sc_module //signal
declarations EDGE_DETECTOR()
SC_method(mainComp) sensitive ltlt dataReady
SC_method(getPixel) sensitive ltlt
clock.pos() void getPixel()
dataReady.write(1) void mainComp() int
i, j for(i 0 i lt 3 i) for(j
0 j lt 3 j) sumX sumX
mem.read()GXij
edge.write(sumX sumY)
SystemC
SystemC Bytecode Compiler
Pinapa Front End
AST
Link
ELAB
SystemC bytecode
Bytecode Back End
Code Generation 1
Register Allocation
50
SystemC Bytecode Emulator
SystemC bytecode
Bytecode uploadable via USB drive
Warping also possible JIT compile bytecode
portions to circuits on FPGA
FPGA
Accelerators speedup emulation
51
Dynamic Enables Expandable Logic Concept
RAM
Expandable Logic
Expandable RAM
uP
Performance
52
Summary