Title: Warp Processing -- Dynamic Transparent Conversion of Binaries to Circuits
1Warp Processing -- Dynamic Transparent Conversion
of Binaries to Circuits
- Frank Vahid
- Professor
- Department of Computer Science and Engineering
- University of California, Riverside
- Associate Director, Center for Embedded Computer
Systems, UC Irvine - Work supported by the National Science
Foundation, the Semiconductor Research
Corporation, Xilinx, Intel, Motorola/Freescale - Contributing Students Roman Lysecky (PhD 2005,
now asst. prof. at U. Arizona), Greg Stitt (PhD
2006), Kris Miller (MS 2007), David Sheldon (3rd
yr PhD), Ryan Mannion (2nd yr PhD), Scott Sirowy
(1st yr PhD)
2Outline
- FPGAs
- Overview
- Hard to program --gt Binary-level partitioning
- Warp processing
- Techniques underlying warp processing
- Overall warp processing results
- Directions and Summary
3FPGAs
- FPGA -- Field-Programmable Gate Array
- Off-the-shelf chip, evolved in early 1990s
- Implements custom circuit just by downloading
stream of bits (software) - Basic idea N-address memory can implement
N-input combinational logic - (Note no gate array inside)
- Memory called Lookup Table, or LUT
- FPGA fabric
- Thousands of small (3-input) LUTs larger LUTs
are inefficient - Thousands of switch matrices (SM) for programming
interconnections - Possibly additional hard core components, like
multipliers, RAM, etc. - CAD tools automatically map desired circuit onto
FPGA fabric
4FPGAs "Programmable" like Microprocessors --
Download Bits
5FPGAs as Coprocessors
- Coprocessor -- Accelarates application kernel by
implementing as circuit - ASIC coprocessor known to speedup many
application kernels - Energy advantages too (e.g., Henkel98,
Rabaey98, Stitt/Vahid04) - FPGA coprocessor also gives speedup/energy
benefits (Stitt/Vahid IEEE DT02, IEEE TECS04) - Con more silicon (20x), 4x performance
overhead (Rose FPGA'06) - Pro platform fully programmable
- Shorter time-to-market, smaller non-recurring
engineering (NRE) cost, low cost devices
available, late changes (even in-product)
Application
Proc.
ASIC
Application
Proc.
FPGA
6FPGAs as Coprocessors Surprisingly Competitive to
ASIC
- FPGA 34 energy savings versus ASICs 48
(Stitt/Vahid IEEE DT02, IEEE TECS04) - A jet isnt as fast as a rocket, but it sure
beats driving
7FPGA Why (Sometimes) Better than Microprocessor
C Code for Bit Reversal
x (x gtgt16) (x ltlt16) x ((x
gtgt 8) 0x00ff00ff) ((x ltlt 8) 0xff00ff00) x
((x gtgt 4) 0x0f0f0f0f) ((x ltlt 4)
0xf0f0f0f0) x ((x gtgt 2) 0x33333333) ((x ltlt
2) 0xcccccccc) x ((x gtgt 1) 0x55555555)
((x ltlt 1) 0xaaaaaaaa)
In general, because of concurrency, from
bit-level to task level
8FPGAs Why (Sometimes) Better than Microprocessor
Hardware for FIR Filter
C Code for FIR Filter
for (i0 i lt 128 i) yi ci
xi .. .. ..
for (i0 i lt 128 i) yi ci
xi .. .. ..
- 1000s of instructions
- Several thousand cycles
9FPGAs are Hard to Program
- Synthesis from hardware description languages
(HDLs) - VHDL, Verilog
- Great for parallelism
- But non-standard languages, manual partitioning
- SystemC a good step
- C/C partitioning compilers
- Use language subset
- Growing in importance
- But special compiler limits adoption
Includes synthesis, tech. map, pace route
100 software writers for every CAD user Only
about 15,000 CAD seats worldwide millions of
compiler seats
FPGA
Proc.
10Binary-Level Partitioning Helps
- Binary-level partitioning
- Stitt/Vahid, ICCAD02
- Recent commercial product Critical Blue
www.criticalblue.com - Partition and synthesize starting from SW binary
- Advantages
- Any compiler, any language, multiple sources,
assembly/object support, legacy code support - Better incorporation into toolflow
- Disadvantage
- Quality loss due to lack of high-level language
constructs? (More later)
Traditional partitioning done here
Includes synthesis, tech. map, place route
Less disruptive, back-end tool
FPGA
Proc.
11Outline
- FPGAs
- Overview
- Hard to program --gt Binary-level partitioning
- Warp processing
- Techniques underlying warp processing
- Overall warp processing results
- Directions and Summary
12Warp Processing
- Observation Dynamic binary recompilation to a
different microprocessor architecture is a mature
commercial technology - e.g., Modern Pentiums translate x86 to VLIW
Question If we can recompile binaries to FPGA
circuits, can we dynamically recompile binaries
to FPGA circuits?
13Warp Processing Idea
1
Initially, software binary loaded into
instruction memory
Profiler
I Mem
µP
D
FPGA
On-chip CAD
14Warp Processing Idea
2
Microprocessor executes instructions in software
binary
Profiler
I Mem
µP
D
FPGA
On-chip CAD
15Warp Processing Idea
3
Profiler monitors instructions and detects
critical regions in binary
Profiler
Profiler
I Mem
µP
µP
beq
beq
beq
beq
beq
beq
beq
beq
beq
beq
add
add
add
add
add
add
add
add
add
add
D
FPGA
On-chip CAD
16Warp Processing Idea
4
On-chip CAD reads in critical region
Profiler
Profiler
I Mem
µP
µP
D
FPGA
On-chip CAD
On-chip CAD
17Warp Processing Idea
5
On-chip CAD decompiles critical region into
control data flow graph (CDFG)
Profiler
Profiler
I Mem
µP
µP
D
FPGA
Dynamic Part. Module (DPM)
On-chip CAD
18Warp Processing Idea
6
On-chip CAD synthesizes decompiled CDFG to a
custom (parallel) circuit
Profiler
Profiler
I Mem
µP
µP
D
FPGA
Dynamic Part. Module (DPM)
On-chip CAD
19Warp Processing Idea
7
On-chip CAD maps circuit onto FPGA
Profiler
Profiler
I Mem
µP
µP
D
FPGA
FPGA
Dynamic Part. Module (DPM)
On-chip CAD
20Warp Processing Idea
On-chip CAD replaces instructions in binary to
use hardware, causing performance and energy to
warp by an order of magnitude or more
8
Mov reg3, 0 Mov reg4, 0 loop // instructions
that interact with FPGA Ret reg4
Profiler
Profiler
I Mem
µP
µP
D
FPGA
FPGA
Dynamic Part. Module (DPM)
On-chip CAD
21Warp Processing Idea
Likely multiple microprocessors per chip,
serviced by one on-chip CAD block
µP
µP
Profiler
µP
µP
µP
I Mem
µP
µP
D
FPGA
On-chip CAD
22Warp Processing Trend Towards Processor/FPGA
Programmable Platforms
- FPGAs with hard core processors
- FPGAs with soft core processors
- Computer boards with FPGAs
Xilinx Virtex II Pro. Source Xilinx
Altera Excalibur. Source Altera
Xilinx Spartan. Source Xilinx
Cray XD1. Source FPGA journal, Apr05
23Warp Processing Trend Towards Processor/FPGA
Programmable Platforms
- Programming a key challenge
- Soln 1 Compile high-level language to custom
binaries using both microprocessor and FPGA - Soln 2 Use standard microprocessor binaries,
dynamically re-compile (warp) - Cons
- Less high-level information when compiling, less
optimization - Pros
- Available to all software developers, not just
specialists - Data dependent optimization
- Most importantly, standard binaries enable
ecosystem among tools, architecture, and
applications
Standard binary (and ecosystem) concept presently
absent in FPGAs and other new programmable
platforms
24Outline
- FPGAs
- Overview
- Hard to program --gt Binary-level partitioning
- Warp processing
- Techniques underlying warp processing
- Overall warp processing results
- Directions and Summary
25Warp Processing Steps (On-Chip CAD)
Technology mapping, placement, and routing
26Warp Processing Profiling and Partitioning
- Applications spend much time in small amount of
code - 90-10 rule
- Observed 75-4 rule for MediaBench, NetBench
- Developed efficient hardware profiler
- Gordon-Ross/Vahid, CASES'04, IEEE Trans. on Comp
06 - Partitioning straightforward
- Try most critical code first
27Warp Processing Decompilation
- Synthesis from binary has a key challenge
- High-level information (e.g., loops, arrays) lost
during compilation - Direct translation of assembly to circuit huge
overheads - Need to recover high-level information
Overhead of microprocessor/FPGA solution WITHOUT
decompilation, vs. microprocessor alone
28Warp Processing Decompilation
- Solution Recover high-level information from
binary decompilation - Extensive previous work (for different purposes)
- Adapted
- Developed new decompilation methods also
Corresponding Assembly
Original C Code
Mov reg3, 0 Mov reg4, 0 loop Shl reg1, reg3,
1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4,
reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret
reg4
long f( short a10 ) long accum for
(int i0 i lt 10 i) accum ai
return accum
29New Decompilation Method Loop Rerolling
- Problem Compiler unrolling of loops (to expose
parallelism) causes synthesis problems - Huge input (slow), cant unroll to desired
amount, cant use advanced loop methods (loop
pipelining, fusion, splitting, ...) - Solution New decompilation method Loop
Rerolling - Identify unrolled iterations, compact into one
iteration
Loop Unrolling
Ld reg2, 100(0) Add reg1, reg1, reg2 Ld reg2,
100(1) Add reg1, reg1, reg2 Ld reg2, 100(2) Add
reg1, reg1, reg2
for (int i0 i lt 3 i) accum ai
30Loop Rerolling Identify Unrolled Iterations
- Find consecutively repeating instruction sequences
Original C Code
x x 1 for (i0 i lt 2 i)
aibi1 yx
31Warp Processing Decompilation
- Study
- Synthesis after decompilation often quite similar
- Almost identical performance, small area overhead
FPGA 2005
322. Deriving high-level constructs from binaries
- Recent study of decompilation robustness
- In presence of compiler optimizations, and
instruction sets - Energy savings of 77/76/87 for
MIPS/ARM/Microblaze
ICCAD05 DATE04
33Decompilation is Effective Even with High
Compiler-Optimization Levels
Average Speedup of 10 Examples
Publication New Decompilation Techniques for
Binary-level Co-processor Generation. G. Stitt,
F. Vahid. IEEE/ACM International Conference on
Computer-Aided Design (ICCAD), Nov. 2005.
34Decompilation Effectiveness In-Depth Study
- Performed in-depth study with Freescale
- H.264 video decoder
- Highly-optimized proprietary code, not reference
code - Huge difference
- Research question Is synthesis from binaries
competitive on highly-optimized code? - Several-month study
MPEG 2
H.264 Better quality, or smaller files, using
more computation
35Optimized H.264
- Larger than most benchmarks
- H.264 16,000 lines
- Previous work 100 to several thousand lines
- Highly-optimized
- H.264 Many man-hours of manual optimization
- 10x faster than reference code used in previous
works - Different profiling results
- Previous examples
- 90 time in several loops
- H.264
- 90 time in 45 functions
- Harder to speedup
36C vs. Binary Synthesis on Opt. H.264
- Binary partitioning competitive with source
partitioning - Speedups compared to ARM9 software
- Binary 2.48, C 2.53
- Decompilation recovered nearly all high-level
information needed for partitioning and synthesis
37Warp Processing Synthesis
Profiling partitioning
- ROCM - Riverside On-Chip Minimizer
- Standard register-transfer synthesis
- Logic synthesis make it lean
- Combination of approaches from Espresso-II
Brayton, et al., 1984Hassoun Sasoa, 2002
and Presto Svoboda White, 1979 - Cost/benefit analysis of operations
- Result
- Single expand phase instead of multiple
iterations - Eliminate need to compute off-set reduces
memory usage - On average only 2 larger than optimal solution
Decompilation
Synthesis
Std. HW Binary
Binary Updater
JIT FPGA compilation
Binary
FPGA binary
Micropr. Binary
Binary
38Warp Processing JIT FPGA Compilation
Profiling partitioning
- Hard Routing is extremely compute/memory
intensive - Solution Jointly design CAD and FPGA
architecture - Cost/benefit analysis
- Highly iterative process
Decompilation
Synthesis
Std. HW Binary
Binary Updater
JIT FPGA compilation
Binary
FPGA binary
Micropr. Binary
Binary
39Warp-Targeted FPGA Architecture
- CAD-specialized configurable logic fabric
- Simplified switch matrices
- Directly connected to adjacent CLB
- All nets are routed using only a single pair of
channels - Allows for efficient routing
- Routing is by far the most time-consuming on-chip
CAD task - Simplified CLBs
- Two 3 input, 2 output LUTs
- Each CLB connected to adjacent CLB to simplify
routing of carry chains - Currently being prototyped by Intel (scheduled
for 2006 Q3 shuttle)
DATE04
40Warp Processing Technology Mapping
- ROCTM - Technology Mapping/Packing
- Decompose hardware circuit into DAG
- Nodes correspond to basic 2-input logic gates
(AND, OR, XOR, etc.) - Hierarchical bottom-up graph clustering algorithm
- Breadth-first traversal combining nodes to form
single-output LUTs - Combine LUTs with common inputs to form final
2-output LUTs - Pack LUTs in which output from one LUT is input
to second LUT
Logic Synthesis
Tech. Mapping/Packing
Placement
JIT FPGA Compilation
Routing
Dynamic Hardware/Software Partitioning A First
Approach, DAC03 A Configurable Logic Fabric for
Dynamic Hardware/Software Partitioning, DATE04
41Warp Processing Placement
- ROCPLACE - Placement
- Dependency-based positional placement algorithm
- Identify critical path, placing critical nodes in
center of CLF - Use dependencies between remaining CLBs to
determine placement - Attempt to use adjacent CLB routing whenever
possible
Dynamic Hardware/Software Partitioning A First
Approach, DAC03 A Configurable Logic Fabric for
Dynamic Hardware/Software Partitioning, DATE04
42Warp Processing Routing
- ROCR - Riverside On-Chip Router
- Requires much less memory than VPR as resource
graph is smaller - 10x faster execution time than VPR (Timing
driven) - Produces circuits with critical path 10 shorter
than VPR (Routablilty
driven)
Logic Synthesis
Tech. Mapping/Packing
Placement
JIT FPGA Compilation
Routing
Dynamic FPGA Routing for Just-in-Time FPGA
Compilation, DAC04
43Outline
- FPGAs
- Overview
- Hard to program --gt Binary-level partitioning
- Warp processing
- Techniques underlying warp processing
- Overall warp processing results
- Directions and Summary
44Experiments with Warp Processing
- Warp Processor
- ARM/MIPS plus our fabric
- Riverside on-chip CAD tools to map critical
region to configurable fabric - Requires less than 2 seconds on lean embedded
processor to perform synthesis and JIT FPGA
compilation - Traditional HW/SW Partitioning
- ARM/MIPS plus Xilinx Virtex-E FPGA
- Manually partitioned software using VHDL
- VHDL synthesized using Xilinx ISE 4.1
45Warp ProcessorsPerformance Speedup (Most
Frequent Kernel Only)
SW Only Execution
46Warp ProcessorsPerformance Speedup (Overall,
Multiple Kernels)
Assuming 100 MHz ARM, and fabric clocked at rate
determined by synthesis
- Energy reduction of 38 - 94
SW Only Execution
47Warp Processors - ResultsExecution Time and
Memory Requirements
48Outline
- FPGAs
- Overview
- Hard to program --gt Binary-level partitioning
- Warp processing
- Techniques underlying warp processing
- Overall warp processing results
- Directions and Summary
49Direction Coding Guidelines for Partitioning?
- In-depth H264 study led to a question Why arent
speedups (from binary or C) closer to ideal
(0-time per fct) - We thus examined dozens of benchmarks in more
detail - Are there simple coding guidelines that result in
better speedups when kernels are synthesized to
circuits?
50Synthesis-Oriented Coding Guidelines
- Pass by value-return
- Declare a local array and copy in all data needed
by a function (makes lack of aliases explicit) - Function specialization
- Create function version having frequent
parameter-values as constants
Rewritten
Original
void f(int width, int height ) . . . .
for (i0 i lt width, i) for (j0 j lt
height j) . . . . . .
void f_4_4() . . . . for (i0 i lt 4,
i) for (j0 j lt 4 j) . . .
. . .
Bounds are explicit so loops are now unrollable
51Synthesis-Oriented Coding Guidelines
- Algorithmic specialization
- Use parallelizable hardware algorithms when
possible - Hoisting and sinking of error checking
- Keep error checking out of loops to enable
unrolling - Lookup table avoidance
- Use expressions rather than lookup tables
Original
Rewritten
Comparisons can now be parallelized
int clip512 . . . void f() . . .
for (i0 i lt 10 i) vali
clipvali . . .
void f() . . . for (i0 i lt 10 i)
if (vali gt 255) vali 255 else if
(vali lt 0) vali 0 . . .
. . .
52Synthesis-Oriented Coding Guidelines
- Use explicit control flow
- Replace function pointers with if statements and
static function calls
Original
Rewritten
void (funcArray) (char data) func1,
func2, . . . void f(char data) . . .
funcPointer funcArrayi (funcPointer)
(data) . . .
void f(char data) . . . if (i 0)
func1(data) else if (i1)
func2(data) . . .
53Coding Guideline Results on H.264
- Simple coding guidelines made large improvement
- Rewritten software only 3 slower than original
- And, binary partitioning still competitive with C
partitioning - Speedups Binary 6.55, C 6.56
- Small difference caused by switch statements that
used indirect jumps
54Coding Guideline Results on Other Benchmarks
- Studied guidelines further on standard benchmarks
- Further synthesis speedups (again, independent of
C vs. binary issue) - More guidelines to be developed
- As compute platforms incorporate FPGAs, might
these guidelines become mainstream?
55Direction New Applications Image Processing
- 32x average speedup compared to uP with 10x
faster clock - Exploits parallelism in image processing
- Window operations contain much fine-grained
parallelism - And, each pixel can be determined in parallel
- Performance is memory-bandwidth limited
- Warp processing can output a pixel per cycle for
each pixel that can be fetched from memory per
cycle - Faster memory will further improve performance
56Direction Applications with Process-Level
Parallelism
- Parallel code provides further speedup
- Average 79x speedup compared to desktop uP
- Use FPGA to implement 10s or 100s of processors
- Can also exploit instruction-level parallelism
- Warp tools will have to detect coarse-grained
parallelism
57Summary
- Showed feasibility of warp technology
- Application kernels can be dynamically mapped to
FPGA by reasonable amount of on-chip compute
resources - Tremendous potential applicability
- Presently investigating
- Embedded (w/ Freescale)
- Desktop (w/ Intel)
- Server (w/ IBM)
- Radically-new FPGA apps may be possible
- Neural networks that rewire themselves? Network
routers whose queuing structure changes based on
traffic patterns? - If the technology exists to synthesize circuits
dynamically, what can we do with that technology?