Warp Processing -- Dynamic Transparent Conversion of Binaries to Circuits - PowerPoint PPT Presentation

About This Presentation
Title:

Warp Processing -- Dynamic Transparent Conversion of Binaries to Circuits

Description:

... .21e-04 1000000.00 74760.00 3.74e-04 1000000.00 358.00 3.58e-06 1000000.00 5.10e-05 21618083.00 0.11 1000000.00 2120.88 26770.00 1410174.00 7.05e-03 1000000.00 ... – PowerPoint PPT presentation

Number of Views:55
Avg rating:3.0/5.0
Slides: 58
Provided by: RomanL4
Learn more at: http://www.cs.ucr.edu
Category:

less

Transcript and Presenter's Notes

Title: Warp Processing -- Dynamic Transparent Conversion of Binaries to Circuits


1
Warp Processing -- Dynamic Transparent Conversion
of Binaries to Circuits
  • Frank Vahid
  • Professor
  • Department of Computer Science and Engineering
  • University of California, Riverside
  • Associate Director, Center for Embedded Computer
    Systems, UC Irvine
  • Work supported by the National Science
    Foundation, the Semiconductor Research
    Corporation, Xilinx, Intel, Motorola/Freescale
  • Contributing Students Roman Lysecky (PhD 2005,
    now asst. prof. at U. Arizona), Greg Stitt (PhD
    2006), Kris Miller (MS 2007), David Sheldon (3rd
    yr PhD), Ryan Mannion (2nd yr PhD), Scott Sirowy
    (1st yr PhD)

2
Outline
  • FPGAs
  • Overview
  • Hard to program --gt Binary-level partitioning
  • Warp processing
  • Techniques underlying warp processing
  • Overall warp processing results
  • Directions and Summary

3
FPGAs
  • FPGA -- Field-Programmable Gate Array
  • Off-the-shelf chip, evolved in early 1990s
  • Implements custom circuit just by downloading
    stream of bits (software)
  • Basic idea N-address memory can implement
    N-input combinational logic
  • (Note no gate array inside)
  • Memory called Lookup Table, or LUT
  • FPGA fabric
  • Thousands of small (3-input) LUTs larger LUTs
    are inefficient
  • Thousands of switch matrices (SM) for programming
    interconnections
  • Possibly additional hard core components, like
    multipliers, RAM, etc.
  • CAD tools automatically map desired circuit onto
    FPGA fabric

4
FPGAs "Programmable" like Microprocessors --
Download Bits
5
FPGAs as Coprocessors
  • Coprocessor -- Accelarates application kernel by
    implementing as circuit
  • ASIC coprocessor known to speedup many
    application kernels
  • Energy advantages too (e.g., Henkel98,
    Rabaey98, Stitt/Vahid04)
  • FPGA coprocessor also gives speedup/energy
    benefits (Stitt/Vahid IEEE DT02, IEEE TECS04)
  • Con more silicon (20x), 4x performance
    overhead (Rose FPGA'06)
  • Pro platform fully programmable
  • Shorter time-to-market, smaller non-recurring
    engineering (NRE) cost, low cost devices
    available, late changes (even in-product)

Application
Proc.
ASIC
Application
Proc.
FPGA
6
FPGAs as Coprocessors Surprisingly Competitive to
ASIC
  • FPGA 34 energy savings versus ASICs 48
    (Stitt/Vahid IEEE DT02, IEEE TECS04)
  • A jet isnt as fast as a rocket, but it sure
    beats driving

7
FPGA Why (Sometimes) Better than Microprocessor
C Code for Bit Reversal
x (x gtgt16) (x ltlt16) x ((x
gtgt 8) 0x00ff00ff) ((x ltlt 8) 0xff00ff00) x
((x gtgt 4) 0x0f0f0f0f) ((x ltlt 4)
0xf0f0f0f0) x ((x gtgt 2) 0x33333333) ((x ltlt
2) 0xcccccccc) x ((x gtgt 1) 0x55555555)
((x ltlt 1) 0xaaaaaaaa)
In general, because of concurrency, from
bit-level to task level
8
FPGAs Why (Sometimes) Better than Microprocessor
Hardware for FIR Filter
C Code for FIR Filter
for (i0 i lt 128 i) yi ci
xi .. .. ..
for (i0 i lt 128 i) yi ci
xi .. .. ..
  • 7 cycles
  • Speedup gt 100x
  • 1000s of instructions
  • Several thousand cycles

9
FPGAs are Hard to Program
  • Synthesis from hardware description languages
    (HDLs)
  • VHDL, Verilog
  • Great for parallelism
  • But non-standard languages, manual partitioning
  • SystemC a good step
  • C/C partitioning compilers
  • Use language subset
  • Growing in importance
  • But special compiler limits adoption

Includes synthesis, tech. map, pace route
100 software writers for every CAD user Only
about 15,000 CAD seats worldwide millions of
compiler seats
FPGA
Proc.
10
Binary-Level Partitioning Helps
  • Binary-level partitioning
  • Stitt/Vahid, ICCAD02
  • Recent commercial product Critical Blue
    www.criticalblue.com
  • Partition and synthesize starting from SW binary
  • Advantages
  • Any compiler, any language, multiple sources,
    assembly/object support, legacy code support
  • Better incorporation into toolflow
  • Disadvantage
  • Quality loss due to lack of high-level language
    constructs? (More later)

Traditional partitioning done here
Includes synthesis, tech. map, place route
Less disruptive, back-end tool
FPGA
Proc.
11
Outline
  • FPGAs
  • Overview
  • Hard to program --gt Binary-level partitioning
  • Warp processing
  • Techniques underlying warp processing
  • Overall warp processing results
  • Directions and Summary

12
Warp Processing
  • Observation Dynamic binary recompilation to a
    different microprocessor architecture is a mature
    commercial technology
  • e.g., Modern Pentiums translate x86 to VLIW

Question If we can recompile binaries to FPGA
circuits, can we dynamically recompile binaries
to FPGA circuits?
13
Warp Processing Idea
1
Initially, software binary loaded into
instruction memory
Profiler
I Mem
µP
D
FPGA
On-chip CAD
14
Warp Processing Idea
2
Microprocessor executes instructions in software
binary
Profiler
I Mem
µP
D
FPGA
On-chip CAD
15
Warp Processing Idea
3
Profiler monitors instructions and detects
critical regions in binary
Profiler
Profiler
I Mem
µP
µP
beq
beq
beq
beq
beq
beq
beq
beq
beq
beq
add
add
add
add
add
add
add
add
add
add
D
FPGA
On-chip CAD
16
Warp Processing Idea
4
On-chip CAD reads in critical region
Profiler
Profiler
I Mem
µP
µP
D
FPGA
On-chip CAD
On-chip CAD
17
Warp Processing Idea
5
On-chip CAD decompiles critical region into
control data flow graph (CDFG)
Profiler
Profiler
I Mem
µP
µP
D
FPGA
Dynamic Part. Module (DPM)
On-chip CAD
18
Warp Processing Idea
6
On-chip CAD synthesizes decompiled CDFG to a
custom (parallel) circuit
Profiler
Profiler
I Mem
µP
µP
D
FPGA
Dynamic Part. Module (DPM)
On-chip CAD
19
Warp Processing Idea
7
On-chip CAD maps circuit onto FPGA
Profiler
Profiler
I Mem
µP
µP
D
FPGA
FPGA
Dynamic Part. Module (DPM)
On-chip CAD


20
Warp Processing Idea
On-chip CAD replaces instructions in binary to
use hardware, causing performance and energy to
warp by an order of magnitude or more
8
Mov reg3, 0 Mov reg4, 0 loop // instructions
that interact with FPGA Ret reg4
Profiler
Profiler
I Mem
µP
µP
D
FPGA
FPGA
Dynamic Part. Module (DPM)
On-chip CAD


21
Warp Processing Idea
Likely multiple microprocessors per chip,
serviced by one on-chip CAD block
µP
µP
Profiler
µP
µP
µP
I Mem
µP
µP
D
FPGA
On-chip CAD
22
Warp Processing Trend Towards Processor/FPGA
Programmable Platforms
  • FPGAs with hard core processors
  • FPGAs with soft core processors
  • Computer boards with FPGAs

Xilinx Virtex II Pro. Source Xilinx
Altera Excalibur. Source Altera
Xilinx Spartan. Source Xilinx
Cray XD1. Source FPGA journal, Apr05
23
Warp Processing Trend Towards Processor/FPGA
Programmable Platforms
  • Programming a key challenge
  • Soln 1 Compile high-level language to custom
    binaries using both microprocessor and FPGA
  • Soln 2 Use standard microprocessor binaries,
    dynamically re-compile (warp)
  • Cons
  • Less high-level information when compiling, less
    optimization
  • Pros
  • Available to all software developers, not just
    specialists
  • Data dependent optimization
  • Most importantly, standard binaries enable
    ecosystem among tools, architecture, and
    applications

Standard binary (and ecosystem) concept presently
absent in FPGAs and other new programmable
platforms
24
Outline
  • FPGAs
  • Overview
  • Hard to program --gt Binary-level partitioning
  • Warp processing
  • Techniques underlying warp processing
  • Overall warp processing results
  • Directions and Summary

25
Warp Processing Steps (On-Chip CAD)
Technology mapping, placement, and routing
26
Warp Processing Profiling and Partitioning
  • Applications spend much time in small amount of
    code
  • 90-10 rule
  • Observed 75-4 rule for MediaBench, NetBench
  • Developed efficient hardware profiler
  • Gordon-Ross/Vahid, CASES'04, IEEE Trans. on Comp
    06
  • Partitioning straightforward
  • Try most critical code first

27
Warp Processing Decompilation
  • Synthesis from binary has a key challenge
  • High-level information (e.g., loops, arrays) lost
    during compilation
  • Direct translation of assembly to circuit huge
    overheads
  • Need to recover high-level information

Overhead of microprocessor/FPGA solution WITHOUT
decompilation, vs. microprocessor alone
28
Warp Processing Decompilation
  • Solution Recover high-level information from
    binary decompilation
  • Extensive previous work (for different purposes)
  • Adapted
  • Developed new decompilation methods also

Corresponding Assembly
Original C Code
Mov reg3, 0 Mov reg4, 0 loop Shl reg1, reg3,
1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4,
reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret
reg4
long f( short a10 ) long accum for
(int i0 i lt 10 i) accum ai
return accum
29
New Decompilation Method Loop Rerolling
  • Problem Compiler unrolling of loops (to expose
    parallelism) causes synthesis problems
  • Huge input (slow), cant unroll to desired
    amount, cant use advanced loop methods (loop
    pipelining, fusion, splitting, ...)
  • Solution New decompilation method Loop
    Rerolling
  • Identify unrolled iterations, compact into one
    iteration

Loop Unrolling
Ld reg2, 100(0) Add reg1, reg1, reg2 Ld reg2,
100(1) Add reg1, reg1, reg2 Ld reg2, 100(2) Add
reg1, reg1, reg2
for (int i0 i lt 3 i) accum ai
30
Loop Rerolling Identify Unrolled Iterations
  • Find consecutively repeating instruction sequences

Original C Code
x x 1 for (i0 i lt 2 i)
aibi1 yx
31
Warp Processing Decompilation
  • Study
  • Synthesis after decompilation often quite similar
  • Almost identical performance, small area overhead

FPGA 2005
32
2. Deriving high-level constructs from binaries
  • Recent study of decompilation robustness
  • In presence of compiler optimizations, and
    instruction sets
  • Energy savings of 77/76/87 for
    MIPS/ARM/Microblaze

ICCAD05 DATE04
33
Decompilation is Effective Even with High
Compiler-Optimization Levels
Average Speedup of 10 Examples
Publication New Decompilation Techniques for
Binary-level Co-processor Generation. G. Stitt,
F. Vahid. IEEE/ACM International Conference on
Computer-Aided Design (ICCAD), Nov. 2005.
34
Decompilation Effectiveness In-Depth Study
  • Performed in-depth study with Freescale
  • H.264 video decoder
  • Highly-optimized proprietary code, not reference
    code
  • Huge difference
  • Research question Is synthesis from binaries
    competitive on highly-optimized code?
  • Several-month study

MPEG 2
H.264 Better quality, or smaller files, using
more computation
35
Optimized H.264
  • Larger than most benchmarks
  • H.264 16,000 lines
  • Previous work 100 to several thousand lines
  • Highly-optimized
  • H.264 Many man-hours of manual optimization
  • 10x faster than reference code used in previous
    works
  • Different profiling results
  • Previous examples
  • 90 time in several loops
  • H.264
  • 90 time in 45 functions
  • Harder to speedup

36
C vs. Binary Synthesis on Opt. H.264
  • Binary partitioning competitive with source
    partitioning
  • Speedups compared to ARM9 software
  • Binary 2.48, C 2.53
  • Decompilation recovered nearly all high-level
    information needed for partitioning and synthesis

37
Warp Processing Synthesis
Profiling partitioning
  • ROCM - Riverside On-Chip Minimizer
  • Standard register-transfer synthesis
  • Logic synthesis make it lean
  • Combination of approaches from Espresso-II
    Brayton, et al., 1984Hassoun Sasoa, 2002
    and Presto Svoboda White, 1979
  • Cost/benefit analysis of operations
  • Result
  • Single expand phase instead of multiple
    iterations
  • Eliminate need to compute off-set reduces
    memory usage
  • On average only 2 larger than optimal solution

Decompilation
Synthesis
Std. HW Binary
Binary Updater
JIT FPGA compilation
Binary
FPGA binary
Micropr. Binary
Binary
38
Warp Processing JIT FPGA Compilation
Profiling partitioning
  • Hard Routing is extremely compute/memory
    intensive
  • Solution Jointly design CAD and FPGA
    architecture
  • Cost/benefit analysis
  • Highly iterative process

Decompilation
Synthesis
Std. HW Binary
Binary Updater
JIT FPGA compilation
Binary
FPGA binary
Micropr. Binary
Binary
39
Warp-Targeted FPGA Architecture
  • CAD-specialized configurable logic fabric
  • Simplified switch matrices
  • Directly connected to adjacent CLB
  • All nets are routed using only a single pair of
    channels
  • Allows for efficient routing
  • Routing is by far the most time-consuming on-chip
    CAD task
  • Simplified CLBs
  • Two 3 input, 2 output LUTs
  • Each CLB connected to adjacent CLB to simplify
    routing of carry chains
  • Currently being prototyped by Intel (scheduled
    for 2006 Q3 shuttle)

DATE04
40
Warp Processing Technology Mapping
  • ROCTM - Technology Mapping/Packing
  • Decompose hardware circuit into DAG
  • Nodes correspond to basic 2-input logic gates
    (AND, OR, XOR, etc.)
  • Hierarchical bottom-up graph clustering algorithm
  • Breadth-first traversal combining nodes to form
    single-output LUTs
  • Combine LUTs with common inputs to form final
    2-output LUTs
  • Pack LUTs in which output from one LUT is input
    to second LUT

Logic Synthesis
Tech. Mapping/Packing
Placement
JIT FPGA Compilation
Routing
Dynamic Hardware/Software Partitioning A First
Approach, DAC03 A Configurable Logic Fabric for
Dynamic Hardware/Software Partitioning, DATE04
41
Warp Processing Placement
  • ROCPLACE - Placement
  • Dependency-based positional placement algorithm
  • Identify critical path, placing critical nodes in
    center of CLF
  • Use dependencies between remaining CLBs to
    determine placement
  • Attempt to use adjacent CLB routing whenever
    possible

Dynamic Hardware/Software Partitioning A First
Approach, DAC03 A Configurable Logic Fabric for
Dynamic Hardware/Software Partitioning, DATE04
42
Warp Processing Routing
  • ROCR - Riverside On-Chip Router
  • Requires much less memory than VPR as resource
    graph is smaller
  • 10x faster execution time than VPR (Timing
    driven)
  • Produces circuits with critical path 10 shorter
    than VPR (Routablilty
    driven)

Logic Synthesis
Tech. Mapping/Packing
Placement
JIT FPGA Compilation
Routing
Dynamic FPGA Routing for Just-in-Time FPGA
Compilation, DAC04
43
Outline
  • FPGAs
  • Overview
  • Hard to program --gt Binary-level partitioning
  • Warp processing
  • Techniques underlying warp processing
  • Overall warp processing results
  • Directions and Summary

44
Experiments with Warp Processing
  • Warp Processor
  • ARM/MIPS plus our fabric
  • Riverside on-chip CAD tools to map critical
    region to configurable fabric
  • Requires less than 2 seconds on lean embedded
    processor to perform synthesis and JIT FPGA
    compilation
  • Traditional HW/SW Partitioning
  • ARM/MIPS plus Xilinx Virtex-E FPGA
  • Manually partitioned software using VHDL
  • VHDL synthesized using Xilinx ISE 4.1

45
Warp ProcessorsPerformance Speedup (Most
Frequent Kernel Only)
SW Only Execution
46
Warp ProcessorsPerformance Speedup (Overall,
Multiple Kernels)
Assuming 100 MHz ARM, and fabric clocked at rate
determined by synthesis
  • Energy reduction of 38 - 94

SW Only Execution
47
Warp Processors - ResultsExecution Time and
Memory Requirements
48
Outline
  • FPGAs
  • Overview
  • Hard to program --gt Binary-level partitioning
  • Warp processing
  • Techniques underlying warp processing
  • Overall warp processing results
  • Directions and Summary

49
Direction Coding Guidelines for Partitioning?
  • In-depth H264 study led to a question Why arent
    speedups (from binary or C) closer to ideal
    (0-time per fct)
  • We thus examined dozens of benchmarks in more
    detail
  • Are there simple coding guidelines that result in
    better speedups when kernels are synthesized to
    circuits?

50
Synthesis-Oriented Coding Guidelines
  • Pass by value-return
  • Declare a local array and copy in all data needed
    by a function (makes lack of aliases explicit)
  • Function specialization
  • Create function version having frequent
    parameter-values as constants

Rewritten
Original
void f(int width, int height ) . . . .
for (i0 i lt width, i) for (j0 j lt
height j) . . . . . .
void f_4_4() . . . . for (i0 i lt 4,
i) for (j0 j lt 4 j) . . .
. . .
Bounds are explicit so loops are now unrollable
51
Synthesis-Oriented Coding Guidelines
  • Algorithmic specialization
  • Use parallelizable hardware algorithms when
    possible
  • Hoisting and sinking of error checking
  • Keep error checking out of loops to enable
    unrolling
  • Lookup table avoidance
  • Use expressions rather than lookup tables

Original
Rewritten
Comparisons can now be parallelized
int clip512 . . . void f() . . .
for (i0 i lt 10 i) vali
clipvali . . .
void f() . . . for (i0 i lt 10 i)
if (vali gt 255) vali 255 else if
(vali lt 0) vali 0 . . .
. . .
52
Synthesis-Oriented Coding Guidelines
  • Use explicit control flow
  • Replace function pointers with if statements and
    static function calls

Original
Rewritten
void (funcArray) (char data) func1,
func2, . . . void f(char data) . . .
funcPointer funcArrayi (funcPointer)
(data) . . .
void f(char data) . . . if (i 0)
func1(data) else if (i1)
func2(data) . . .
53
Coding Guideline Results on H.264
  • Simple coding guidelines made large improvement
  • Rewritten software only 3 slower than original
  • And, binary partitioning still competitive with C
    partitioning
  • Speedups Binary 6.55, C 6.56
  • Small difference caused by switch statements that
    used indirect jumps

54
Coding Guideline Results on Other Benchmarks
  • Studied guidelines further on standard benchmarks
  • Further synthesis speedups (again, independent of
    C vs. binary issue)
  • More guidelines to be developed
  • As compute platforms incorporate FPGAs, might
    these guidelines become mainstream?

55
Direction New Applications Image Processing
  • 32x average speedup compared to uP with 10x
    faster clock
  • Exploits parallelism in image processing
  • Window operations contain much fine-grained
    parallelism
  • And, each pixel can be determined in parallel
  • Performance is memory-bandwidth limited
  • Warp processing can output a pixel per cycle for
    each pixel that can be fetched from memory per
    cycle
  • Faster memory will further improve performance

56
Direction Applications with Process-Level
Parallelism
  • Parallel code provides further speedup
  • Average 79x speedup compared to desktop uP
  • Use FPGA to implement 10s or 100s of processors
  • Can also exploit instruction-level parallelism
  • Warp tools will have to detect coarse-grained
    parallelism

57
Summary
  • Showed feasibility of warp technology
  • Application kernels can be dynamically mapped to
    FPGA by reasonable amount of on-chip compute
    resources
  • Tremendous potential applicability
  • Presently investigating
  • Embedded (w/ Freescale)
  • Desktop (w/ Intel)
  • Server (w/ IBM)
  • Radically-new FPGA apps may be possible
  • Neural networks that rewire themselves? Network
    routers whose queuing structure changes based on
    traffic patterns?
  • If the technology exists to synthesize circuits
    dynamically, what can we do with that technology?
Write a Comment
User Comments (0)
About PowerShow.com