Warp Processing -- Dynamic Transparent Conversion of Binaries to Circuits - PowerPoint PPT Presentation

About This Presentation

Title:

Warp Processing -- Dynamic Transparent Conversion of Binaries to Circuits

Description:

... .21e-04 1000000.00 74760.00 3.74e-04 1000000.00 358.00 3.58e-06 1000000.00 5.10e-05 21618083.00 0.11 1000000.00 2120.88 26770.00 1410174.00 7.05e-03 1000000.00 ... – PowerPoint PPT presentation

Number of Views:55

Avg rating:3.0/5.0

Slides: 58

Provided by: RomanL4

Learn more at: http://www.cs.ucr.edu

Category:

more less

Transcript and Presenter's Notes

Title: Warp Processing -- Dynamic Transparent Conversion of Binaries to Circuits

1
Warp Processing -- Dynamic Transparent Conversion
of Binaries to Circuits

Frank Vahid
Professor
Department of Computer Science and Engineering
University of California, Riverside
Associate Director, Center for Embedded Computer
Systems, UC Irvine
Work supported by the National Science
Foundation, the Semiconductor Research
Corporation, Xilinx, Intel, Motorola/Freescale
Contributing Students Roman Lysecky (PhD 2005,
now asst. prof. at U. Arizona), Greg Stitt (PhD
2006), Kris Miller (MS 2007), David Sheldon (3rd
yr PhD), Ryan Mannion (2nd yr PhD), Scott Sirowy
(1st yr PhD)

2
Outline

FPGAs
Overview
Hard to program --gt Binary-level partitioning
Warp processing
Techniques underlying warp processing
Overall warp processing results
Directions and Summary

3
FPGAs

FPGA -- Field-Programmable Gate Array
Off-the-shelf chip, evolved in early 1990s
Implements custom circuit just by downloading
stream of bits (software)
Basic idea N-address memory can implement
N-input combinational logic
(Note no gate array inside)
Memory called Lookup Table, or LUT
FPGA fabric
Thousands of small (3-input) LUTs larger LUTs
are inefficient
Thousands of switch matrices (SM) for programming
interconnections
Possibly additional hard core components, like
multipliers, RAM, etc.
CAD tools automatically map desired circuit onto
FPGA fabric

4
FPGAs "Programmable" like Microprocessors --
Download Bits
5
FPGAs as Coprocessors

Coprocessor -- Accelarates application kernel by
implementing as circuit
ASIC coprocessor known to speedup many
application kernels
Energy advantages too (e.g., Henkel98,
Rabaey98, Stitt/Vahid04)
FPGA coprocessor also gives speedup/energy
benefits (Stitt/Vahid IEEE DT02, IEEE TECS04)
Con more silicon (20x), 4x performance
overhead (Rose FPGA'06)
Pro platform fully programmable
Shorter time-to-market, smaller non-recurring
engineering (NRE) cost, low cost devices
available, late changes (even in-product)

Application
Proc.
ASIC
Application
Proc.
FPGA
6
FPGAs as Coprocessors Surprisingly Competitive to
ASIC

FPGA 34 energy savings versus ASICs 48
(Stitt/Vahid IEEE DT02, IEEE TECS04)
A jet isnt as fast as a rocket, but it sure
beats driving

7
FPGA Why (Sometimes) Better than Microprocessor
C Code for Bit Reversal
x (x gtgt16) (x ltlt16) x ((x
gtgt 8) 0x00ff00ff) ((x ltlt 8) 0xff00ff00) x
((x gtgt 4) 0x0f0f0f0f) ((x ltlt 4)
0xf0f0f0f0) x ((x gtgt 2) 0x33333333) ((x ltlt
2) 0xcccccccc) x ((x gtgt 1) 0x55555555)
((x ltlt 1) 0xaaaaaaaa)
In general, because of concurrency, from
bit-level to task level
8
FPGAs Why (Sometimes) Better than Microprocessor
Hardware for FIR Filter
C Code for FIR Filter
for (i0 i lt 128 i) yi ci
xi .. .. ..
for (i0 i lt 128 i) yi ci
xi .. .. ..

7 cycles
Speedup gt 100x

1000s of instructions
Several thousand cycles

9
FPGAs are Hard to Program

Synthesis from hardware description languages
(HDLs)
VHDL, Verilog
Great for parallelism
But non-standard languages, manual partitioning
SystemC a good step
C/C partitioning compilers
Use language subset
Growing in importance
But special compiler limits adoption

Includes synthesis, tech. map, pace route
100 software writers for every CAD user Only
about 15,000 CAD seats worldwide millions of
compiler seats
FPGA
Proc.
10
Binary-Level Partitioning Helps

Binary-level partitioning
Stitt/Vahid, ICCAD02
Recent commercial product Critical Blue
www.criticalblue.com
Partition and synthesize starting from SW binary
Advantages
Any compiler, any language, multiple sources,
assembly/object support, legacy code support
Better incorporation into toolflow
Disadvantage
Quality loss due to lack of high-level language
constructs? (More later)

Traditional partitioning done here
Includes synthesis, tech. map, place route
Less disruptive, back-end tool
FPGA
Proc.
11
Outline

FPGAs
Overview
Hard to program --gt Binary-level partitioning
Warp processing
Techniques underlying warp processing
Overall warp processing results
Directions and Summary

12
Warp Processing

Observation Dynamic binary recompilation to a
different microprocessor architecture is a mature
commercial technology
e.g., Modern Pentiums translate x86 to VLIW

Question If we can recompile binaries to FPGA
circuits, can we dynamically recompile binaries
to FPGA circuits?
13
Warp Processing Idea
1
Initially, software binary loaded into
instruction memory
Profiler
I Mem
µP
D
FPGA
On-chip CAD
14
Warp Processing Idea
2
Microprocessor executes instructions in software
binary
Profiler
I Mem
µP
D
FPGA
On-chip CAD
15
Warp Processing Idea
3
Profiler monitors instructions and detects
critical regions in binary
Profiler
Profiler
I Mem
µP
µP
beq
beq
beq
beq
beq
beq
beq
beq
beq
beq
add
add
add
add
add
add
add
add
add
add
D
FPGA
On-chip CAD
16
Warp Processing Idea
4
On-chip CAD reads in critical region
Profiler
Profiler
I Mem
µP
µP
D
FPGA
On-chip CAD
On-chip CAD
17
Warp Processing Idea
5
On-chip CAD decompiles critical region into
control data flow graph (CDFG)
Profiler
Profiler
I Mem
µP
µP
D
FPGA
Dynamic Part. Module (DPM)
On-chip CAD
18
Warp Processing Idea
6
On-chip CAD synthesizes decompiled CDFG to a
custom (parallel) circuit
Profiler
Profiler
I Mem
µP
µP
D
FPGA
Dynamic Part. Module (DPM)
On-chip CAD
19
Warp Processing Idea
7
On-chip CAD maps circuit onto FPGA
Profiler
Profiler
I Mem
µP
µP
D
FPGA
FPGA
Dynamic Part. Module (DPM)
On-chip CAD

20
Warp Processing Idea
On-chip CAD replaces instructions in binary to
use hardware, causing performance and energy to
warp by an order of magnitude or more
8
Mov reg3, 0 Mov reg4, 0 loop // instructions
that interact with FPGA Ret reg4
Profiler
Profiler
I Mem
µP
µP
D
FPGA
FPGA
Dynamic Part. Module (DPM)
On-chip CAD

21
Warp Processing Idea
Likely multiple microprocessors per chip,
serviced by one on-chip CAD block
µP
µP
Profiler
µP
µP
µP
I Mem
µP
µP
D
FPGA
On-chip CAD
22
Warp Processing Trend Towards Processor/FPGA
Programmable Platforms

FPGAs with hard core processors
FPGAs with soft core processors
Computer boards with FPGAs

Xilinx Virtex II Pro. Source Xilinx
Altera Excalibur. Source Altera
Xilinx Spartan. Source Xilinx
Cray XD1. Source FPGA journal, Apr05
23
Warp Processing Trend Towards Processor/FPGA
Programmable Platforms

Programming a key challenge
Soln 1 Compile high-level language to custom
binaries using both microprocessor and FPGA
Soln 2 Use standard microprocessor binaries,
dynamically re-compile (warp)
Cons
Less high-level information when compiling, less
optimization
Pros
Available to all software developers, not just
specialists
Data dependent optimization
Most importantly, standard binaries enable
ecosystem among tools, architecture, and
applications

Standard binary (and ecosystem) concept presently
absent in FPGAs and other new programmable
platforms
24
Outline

FPGAs
Overview
Hard to program --gt Binary-level partitioning
Warp processing
Techniques underlying warp processing
Overall warp processing results
Directions and Summary

25
Warp Processing Steps (On-Chip CAD)
Technology mapping, placement, and routing
26
Warp Processing Profiling and Partitioning

Applications spend much time in small amount of
code
90-10 rule
Observed 75-4 rule for MediaBench, NetBench
Developed efficient hardware profiler
Gordon-Ross/Vahid, CASES'04, IEEE Trans. on Comp
06
Partitioning straightforward
Try most critical code first

27
Warp Processing Decompilation

Synthesis from binary has a key challenge
High-level information (e.g., loops, arrays) lost
during compilation
Direct translation of assembly to circuit huge
overheads
Need to recover high-level information

Overhead of microprocessor/FPGA solution WITHOUT
decompilation, vs. microprocessor alone
28
Warp Processing Decompilation

Solution Recover high-level information from
binary decompilation
Extensive previous work (for different purposes)
Adapted
Developed new decompilation methods also

Corresponding Assembly
Original C Code
Mov reg3, 0 Mov reg4, 0 loop Shl reg1, reg3,
1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4,
reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret
reg4
long f( short a10 ) long accum for
(int i0 i lt 10 i) accum ai
return accum
29
New Decompilation Method Loop Rerolling

Problem Compiler unrolling of loops (to expose
parallelism) causes synthesis problems
Huge input (slow), cant unroll to desired
amount, cant use advanced loop methods (loop
pipelining, fusion, splitting, ...)
Solution New decompilation method Loop
Rerolling
Identify unrolled iterations, compact into one
iteration

Loop Unrolling
Ld reg2, 100(0) Add reg1, reg1, reg2 Ld reg2,
100(1) Add reg1, reg1, reg2 Ld reg2, 100(2) Add
reg1, reg1, reg2
for (int i0 i lt 3 i) accum ai
30
Loop Rerolling Identify Unrolled Iterations

Find consecutively repeating instruction sequences

Original C Code
x x 1 for (i0 i lt 2 i)
aibi1 yx
31
Warp Processing Decompilation

Study
Synthesis after decompilation often quite similar
Almost identical performance, small area overhead

FPGA 2005
32
2. Deriving high-level constructs from binaries

Recent study of decompilation robustness
In presence of compiler optimizations, and
instruction sets
Energy savings of 77/76/87 for
MIPS/ARM/Microblaze

ICCAD05 DATE04
33
Decompilation is Effective Even with High
Compiler-Optimization Levels
Average Speedup of 10 Examples
Publication New Decompilation Techniques for
Binary-level Co-processor Generation. G. Stitt,
F. Vahid. IEEE/ACM International Conference on
Computer-Aided Design (ICCAD), Nov. 2005.
34
Decompilation Effectiveness In-Depth Study

Performed in-depth study with Freescale
H.264 video decoder
Highly-optimized proprietary code, not reference
code
Huge difference
Research question Is synthesis from binaries
competitive on highly-optimized code?
Several-month study

MPEG 2
H.264 Better quality, or smaller files, using
more computation
35
Optimized H.264

Larger than most benchmarks
H.264 16,000 lines
Previous work 100 to several thousand lines
Highly-optimized
H.264 Many man-hours of manual optimization
10x faster than reference code used in previous
works
Different profiling results
Previous examples
90 time in several loops
H.264
90 time in 45 functions
Harder to speedup

36
C vs. Binary Synthesis on Opt. H.264

Binary partitioning competitive with source
partitioning
Speedups compared to ARM9 software
Binary 2.48, C 2.53
Decompilation recovered nearly all high-level
information needed for partitioning and synthesis

37
Warp Processing Synthesis
Profiling partitioning

ROCM - Riverside On-Chip Minimizer
Standard register-transfer synthesis
Logic synthesis make it lean
Combination of approaches from Espresso-II
Brayton, et al., 1984Hassoun Sasoa, 2002
and Presto Svoboda White, 1979
Cost/benefit analysis of operations
Result
Single expand phase instead of multiple
iterations
Eliminate need to compute off-set reduces
memory usage
On average only 2 larger than optimal solution

Decompilation
Synthesis
Std. HW Binary
Binary Updater
JIT FPGA compilation
Binary
FPGA binary
Micropr. Binary
Binary
38
Warp Processing JIT FPGA Compilation
Profiling partitioning

Hard Routing is extremely compute/memory
intensive
Solution Jointly design CAD and FPGA
architecture
Cost/benefit analysis
Highly iterative process

Decompilation
Synthesis
Std. HW Binary
Binary Updater
JIT FPGA compilation
Binary
FPGA binary
Micropr. Binary
Binary
39
Warp-Targeted FPGA Architecture

CAD-specialized configurable logic fabric
Simplified switch matrices
Directly connected to adjacent CLB
All nets are routed using only a single pair of
channels
Allows for efficient routing
Routing is by far the most time-consuming on-chip
CAD task
Simplified CLBs
Two 3 input, 2 output LUTs
Each CLB connected to adjacent CLB to simplify
routing of carry chains
Currently being prototyped by Intel (scheduled
for 2006 Q3 shuttle)

DATE04
40
Warp Processing Technology Mapping

ROCTM - Technology Mapping/Packing
Decompose hardware circuit into DAG
Nodes correspond to basic 2-input logic gates
(AND, OR, XOR, etc.)
Hierarchical bottom-up graph clustering algorithm
Breadth-first traversal combining nodes to form
single-output LUTs
Combine LUTs with common inputs to form final
2-output LUTs
Pack LUTs in which output from one LUT is input
to second LUT

Logic Synthesis
Tech. Mapping/Packing
Placement
JIT FPGA Compilation
Routing
Dynamic Hardware/Software Partitioning A First
Approach, DAC03 A Configurable Logic Fabric for
Dynamic Hardware/Software Partitioning, DATE04
41
Warp Processing Placement

ROCPLACE - Placement
Dependency-based positional placement algorithm
Identify critical path, placing critical nodes in
center of CLF
Use dependencies between remaining CLBs to
determine placement
Attempt to use adjacent CLB routing whenever
possible

Dynamic Hardware/Software Partitioning A First
Approach, DAC03 A Configurable Logic Fabric for
Dynamic Hardware/Software Partitioning, DATE04
42
Warp Processing Routing

ROCR - Riverside On-Chip Router
Requires much less memory than VPR as resource
graph is smaller
10x faster execution time than VPR (Timing
driven)
Produces circuits with critical path 10 shorter
than VPR (Routablilty
driven)

Logic Synthesis
Tech. Mapping/Packing
Placement
JIT FPGA Compilation
Routing
Dynamic FPGA Routing for Just-in-Time FPGA
Compilation, DAC04
43
Outline

FPGAs
Overview
Hard to program --gt Binary-level partitioning
Warp processing
Techniques underlying warp processing
Overall warp processing results
Directions and Summary

44
Experiments with Warp Processing

Warp Processor
ARM/MIPS plus our fabric
Riverside on-chip CAD tools to map critical
region to configurable fabric
Requires less than 2 seconds on lean embedded
processor to perform synthesis and JIT FPGA
compilation
Traditional HW/SW Partitioning
ARM/MIPS plus Xilinx Virtex-E FPGA
Manually partitioned software using VHDL
VHDL synthesized using Xilinx ISE 4.1

45
Warp ProcessorsPerformance Speedup (Most
Frequent Kernel Only)
SW Only Execution
46
Warp ProcessorsPerformance Speedup (Overall,
Multiple Kernels)
Assuming 100 MHz ARM, and fabric clocked at rate
determined by synthesis

Energy reduction of 38 - 94

SW Only Execution
47
Warp Processors - ResultsExecution Time and
Memory Requirements
48
Outline

FPGAs
Overview
Hard to program --gt Binary-level partitioning
Warp processing
Techniques underlying warp processing
Overall warp processing results
Directions and Summary

49
Direction Coding Guidelines for Partitioning?

In-depth H264 study led to a question Why arent
speedups (from binary or C) closer to ideal
(0-time per fct)
We thus examined dozens of benchmarks in more
detail
Are there simple coding guidelines that result in
better speedups when kernels are synthesized to
circuits?

50
Synthesis-Oriented Coding Guidelines

Pass by value-return
Declare a local array and copy in all data needed
by a function (makes lack of aliases explicit)
Function specialization
Create function version having frequent
parameter-values as constants

Rewritten
Original
void f(int width, int height ) . . . .
for (i0 i lt width, i) for (j0 j lt
height j) . . . . . .
void f_4_4() . . . . for (i0 i lt 4,
i) for (j0 j lt 4 j) . . .
. . .
Bounds are explicit so loops are now unrollable
51
Synthesis-Oriented Coding Guidelines

Algorithmic specialization
Use parallelizable hardware algorithms when
possible
Hoisting and sinking of error checking
Keep error checking out of loops to enable
unrolling
Lookup table avoidance
Use expressions rather than lookup tables

Original
Rewritten
Comparisons can now be parallelized
int clip512 . . . void f() . . .
for (i0 i lt 10 i) vali
clipvali . . .
void f() . . . for (i0 i lt 10 i)
if (vali gt 255) vali 255 else if
(vali lt 0) vali 0 . . .
. . .
52
Synthesis-Oriented Coding Guidelines

Use explicit control flow
Replace function pointers with if statements and
static function calls

Original
Rewritten
void (funcArray) (char data) func1,
func2, . . . void f(char data) . . .
funcPointer funcArrayi (funcPointer)
(data) . . .
void f(char data) . . . if (i 0)
func1(data) else if (i1)
func2(data) . . .
53
Coding Guideline Results on H.264

Simple coding guidelines made large improvement
Rewritten software only 3 slower than original
And, binary partitioning still competitive with C
partitioning
Speedups Binary 6.55, C 6.56
Small difference caused by switch statements that
used indirect jumps

54
Coding Guideline Results on Other Benchmarks

Studied guidelines further on standard benchmarks
Further synthesis speedups (again, independent of
C vs. binary issue)
More guidelines to be developed
As compute platforms incorporate FPGAs, might
these guidelines become mainstream?

55
Direction New Applications Image Processing

32x average speedup compared to uP with 10x
faster clock
Exploits parallelism in image processing
Window operations contain much fine-grained
parallelism
And, each pixel can be determined in parallel
Performance is memory-bandwidth limited
Warp processing can output a pixel per cycle for
each pixel that can be fetched from memory per
cycle
Faster memory will further improve performance

56
Direction Applications with Process-Level
Parallelism

Parallel code provides further speedup
Average 79x speedup compared to desktop uP
Use FPGA to implement 10s or 100s of processors
Can also exploit instruction-level parallelism
Warp tools will have to detect coarse-grained
parallelism

57
Summary

Showed feasibility of warp technology
Application kernels can be dynamically mapped to
FPGA by reasonable amount of on-chip compute
resources
Tremendous potential applicability
Presently investigating
Embedded (w/ Freescale)
Desktop (w/ Intel)
Server (w/ IBM)
Radically-new FPGA apps may be possible
Neural networks that rewire themselves? Network
routers whose queuing structure changes based on
traffic patterns?
If the technology exists to synthesize circuits
dynamically, what can we do with that technology?