Title: Warp Processing: Making FPGAs Ubiquitous via Invisible Synthesis
1Warp Processing Making FPGAs Ubiquitous via
Invisible Synthesis
- Greg Stitt
- Department of Electrical and Computer Engineering
- University of Florida
2Introduction
- Improved performance enables new applications
- Past decade - Mp3 players, portable game
consoles, cell phones, etc. - Future architectures - Speech/image recognition,
self-guiding cars, computation biology, etc.
3Introduction
- FPGAs (Field Programmable Gate Arrays)
Implement custom circuits - 10x, 100x, even 1000x for scientific and embedded
apps - Najjar 04He, Lu, Sun 05Levine, Schmit
03Prasanna 06Stitt, Vahid 05, - But, FPGAs not mainstream
- Warp Processing Goal Bring FPGAs into mainstream
- Make FPGAs Invisible
FPGAs capable of large performance improvements
Performance
uP
FPGA
4Introduction Hardware/Software Partitioning
C Code for FIR Filter
for (i0 i lt 16 i) yi ci
xi .. .. ..
for (i0 i lt 128 i) yi ci
xi .. .. ..
5Introduction High-level Synthesis
- Problem Describing circuit using HDL is time
consuming/difficult - Solution High-level synthesis
- Create circuit from high-level code
- Gupta, DeMicheli 92Camposano, Wolf 91Rabaey
96Gajski, Dutt 92 - Allows developers to use higher-level
specification - Potentially, enables synthesis for software
developers
6Introduction High-level Synthesis
- Problem Describing circuit using HDL is time
consuming/difficult - Solution High-level synthesis
- Create circuit from high-level code
- Gupta, DeMicheli 92Camposano, Wolf 91Rabaey
96Gajski, Dutt 92 - Allows developers to use higher-level
specification - Potentially, enables synthesis for software
developers
7Introduction High-level Synthesis
- Problem Describing circuit using HDL is time
consuming/difficult - Solution High-level synthesis
- Create circuit from high-level code
- Gupta, DeMicheli 92Camposano, Wolf 91Rabaey
96Gajski, Dutt 92 - Allows developers to use higher-level
specification - Potentially, enables synthesis for software
developers
for (i0 i lt 16 i) yi ci xi
8Outline
- Introduction
- Warp Processing Overview
- Enabling Technology Binary Synthesis
- Key techniques for synthesis from binaries
- Decompilation
- Current and Future Directions
- Multi-threaded Warp Processing
- Custom Communication
9Problems with High-Level Synthesis
- Problem High-level synthesis is unattractive to
software developers - Requires specialized language
- SystemC, NapaC, HandelC,
- Requires specialized compiler
- Spark, ROCCC, CatapultC,
- Limited commercial success
- Software developers reluctant to change tools
uP
FPGA
10Warp Processing Invisible Synthesis
- Solution Make synthesis invisible
- 2 Requirements
- Standard software tool flow
- Perform compilation before synthesis
- Hide synthesis tool
- Move synthesis on chip
- Similar to dynamic binary translation
- Transmeta
- But, translate to hw
11Warp Processing Invisible Synthesis
- Solution Make synthesis invisible
- 2 Requirements
- Standard software tool flow
- Perform compilation before synthesis
- Hide synthesis tool
- Move synthesis on chip
- Similar to dynamic binary translation
- Transmeta
- But, translate to hw
Warp processor looks like standard uP but
invisibly synthesizes hardware
uP
FPGA
12Warp Processing Invisible Synthesis
- Advantages
- Supports all languages,compilers, IDEs
- Supports synthesis of assembly code
- Support synthesis of library code
- Also, enables dynamic optimizations
Warp processor looks like standard uP but
invisibly synthesizes hardware
uP
FPGA
13Warp Processing Background Basic Idea
1
Initially, software binary loaded into
instruction memory
Profiler
I Mem
µP
D
FPGA
On-chip CAD
14Warp Processing Background Basic Idea
2
Microprocessor executes instructions in software
binary
Profiler
I Mem
µP
D
FPGA
On-chip CAD
15Warp Processing Background Basic Idea
3
Profiler monitors instructions and detects
critical regions in binary
Profiler
Profiler
I Mem
µP
µP
beq
beq
beq
beq
beq
beq
beq
beq
beq
beq
add
add
add
add
add
add
add
add
add
add
D
FPGA
On-chip CAD
16Warp Processing Background Basic Idea
4
On-chip CAD reads in critical region
Profiler
Profiler
I Mem
µP
µP
D
FPGA
On-chip CAD
On-chip CAD
17Warp Processing Background Basic Idea
5
On-chip CAD converts critical region into control
data flow graph (CDFG)
Profiler
Profiler
I Mem
µP
µP
D
FPGA
Dynamic Part. Module (DPM)
On-chip CAD
18Warp Processing Background Basic Idea
6
On-chip CAD synthesizes decompiled CDFG to a
custom (parallel) circuit
Profiler
Profiler
I Mem
µP
µP
D
FPGA
Dynamic Part. Module (DPM)
On-chip CAD
19Warp Processing Background Basic Idea
7
On-chip CAD maps circuit onto FPGA
Profiler
Profiler
I Mem
µP
µP
D
FPGA
FPGA
Dynamic Part. Module (DPM)
On-chip CAD
20Warp Processing Background Basic Idea
On-chip CAD replaces instructions in binary to
use hardware, causing performance and energy to
warp by an order of magnitude or more
8
Mov reg3, 0 Mov reg4, 0 loop // instructions
that interact with FPGA Ret reg4
Profiler
Profiler
I Mem
µP
µP
D
FPGA
FPGA
Dynamic Part. Module (DPM)
On-chip CAD
21Expandable Logic
RAM
Expandable Logic
Expandable RAM
uP
Performance
22Expandable Logic
- Allows for customization of platforms
- User can select FPGAs based on used applications
Performance
Application
Portable Gaming
23Expandable Logic
- Allows for customization of platforms
- User can select FPGAs based on used applications
Performance
Application
Portable Gaming
24Expandable Logic
- Allows for customization of platforms
- User can select FPGAs based on used applications
Performance
Application
Web Browser
No-FPGA
- Platform designer doesnt have to decide on fixed
amount of FPGA. - User doesnt have to pay for FPGA that isnt
needed
25Warp Processing Background Basic Technology
- Challenge CAD tools normally require powerful
workstations - Develop extremely efficient on-chip CAD tools
- Requires efficient synthesis
- Requires specialized FPGA, physical design tools
(JIT FPGA compilation) - Lysecky FCCM05/DAC04, University of
Arizona
JIT FPGA compilation
26Warp Processing Background On-Chip CAD
Tech. Map
Synthesis
Log. Opt.
RT Syn.
Place
Route
9.1 s
Xilinx ISE
27Warp Processing Initial Results
- Embedded Applications
- Average speedup of 6.3x
- Achieved completely transparently
- Also, energy savings of 66
28Outline
- Introduction
- Warp Processing Overview
- Enabling Technology Binary Synthesis
- Key techniques for synthesis from binaries
- Decompilation
- Current and Future Directions
- Multi-threaded Warp Processing
- Custom Communication
29Binary Synthesis
for (i0 i lt 128 i) yi ci
xi .. ..
for (i0 i lt 128 i) yi ci
xi .. ..
- Warp processors perform synthesis from software
binary binary synthesis - Problem No high-level information
- Synthesis needs high-level constructs
- gt 10x slowdown
- Can we recover high-level information for
synthesis? - Make binary synthesis (and Warp processing)
competitive with high-level synthesis
Addi r1, r0, 0 Ld r3, 256(r1) Ld r4, 512(r1) Subi
r2, r1, 128 Jnz r2, -5
No high-level constructs arrays, loops, etc.
Hardware can be gt 10x to 100x
30Decompilation
- We realized decompilation recovers high-level
information - But, generally used for binary translation or
source-code recovery - May not be suitable for synthesis
- We studied existing approaches
- Cifuentes 94, 99, 01Mycroft 99,01
- DisC, dcc, Boomerang, Mocha, SourceAgain
- Determined relevant techniques
- Adapted existing techniques for synthesis
31Decompilation Control/Data Flow Graph Recovery
- Recovery of control/data flow graph (CDFG)
- Format used by synthesis
- Difficult because of indirect jumps
- Cannot statically analyze control flow
- But, heuristics are over 99 successful on
standard benchmarks - Cifuentes 99, 00
Corresponding Assembly
Control/Data Flow Graph Creation
Original C Code
reg3 0 reg4 0
Mov reg3, 0 Mov reg4, 0 loop Shl reg1, reg3,
1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4,
reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret
reg4
long f( short a10 ) long accum for
(int i0 i lt 10 i) accum ai
return accum
loop reg1 reg3 ltlt 1 reg5 reg2 reg1 reg6
memreg5 0 reg4 reg4 reg6 reg3 reg3
1 if (reg3 lt 10) goto loop
ret reg4
32Decompilation Data Flow Analysis
- Original purpose - remove temporary registers
- Area overhead 130
- Need new techniques for binary synthesis
Data Flow Analysis
Corresponding Assembly
Original C Code
reg3 0 reg4 0
Mov reg3, 0 Mov reg4, 0 loop Shl reg1, reg3,
1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4,
reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret
reg4
long f( short a10 ) long accum for
(int i0 i lt 10 i) accum ai
return accum
loop reg4 reg4 mem reg2 (reg3 ltlt
1) reg3 reg3 1 if (reg3 lt 10) goto loop
ret reg4
33Decompilation Data Flow Analysis
- Strength Reduction Compare-with-zero
instructions - Operator Size Reduction
Sub reg3, reg4, reg5 Bz reg3, -5
32-bit reg4
32-bit reg5
Lb reg4, 0(reg1) Mvi reg5, 16 Add reg3, reg4, reg5
32-bit
32-bit reg3
Area Overhead Reduced to 10
34Decompilation Function Recovery
- Recover parameters and return values
- Def-use analysis of prologue/epilogue
- 100 success rate
Corresponding Assembly
Function Recovery
Original C Code
long f( long reg2 ) int reg3 0 int
reg4 0 loop reg4 reg4 memreg2
reg3 ltlt 1) reg3 reg3 1
if (reg3 lt 10) goto loop return reg4
Mov reg3, 0 Mov reg4, 0 loop Shl reg1, reg3,
1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4,
reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret
reg4
long f( short a10 ) long accum for
(int i0 i lt 10 i) accum ai
return accum
35Decompilation Control Structure Recovery
- Recover loops, if statements
- Uses interval analysis techniques
- Cifuentes 94
- 100 success rate
Corresponding Assembly
Control Structure Recovery
Original C Code
Mov reg3, 0 Mov reg4, 0 loop Shl reg1, reg3,
1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4,
reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret
reg4
long f( short a10 ) long accum for
(int i0 i lt 10 i) accum ai
return accum
long f( long reg2 ) long reg4 0 for
(long reg3 0 reg3 lt 10 reg3) reg4
memreg2 (reg3 ltlt 1) return
reg4
36Decompilation Array Recovery
- Detect linear memory patterns and row-major
ordering calculations - 95 success rate
- Stitt, Guo, Najjar, Vahid 05
- Cifuentes 00
Corresponding Assembly
Array Recovery
Original C Code
Mov reg3, 0 Mov reg4, 0 loop Shl reg1, reg3,
1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4,
reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret
reg4
long f( short a10 ) long accum for
(int i0 i lt 10 i) accum ai
return accum
long f( short array10 ) long reg4 0
for (long reg3 0 reg3 lt 10 reg3)
reg4 arrayreg3 return reg4
37Comparison of Decompiled Code and Original Code
- Decompiled code almost identical to original code
- Only difference is variable names
- Binary synthesis is competitive with high-level
synthesis
Original C Code
Decompiled Code
long f( short a10 ) long accum for
(int i0 i lt 10 i) accum ai
return accum
long f( short array10 ) long reg4 0
for (long reg3 0 reg3 lt 10 reg3)
reg4 arrayreg3 return reg4
Almost Identical Representations
38Binary Synthesis Tool Flow
Libraries/ Object Code
Libraries/ Object Code
uP
FPGA
30,000 lines of C code
39Binary Synthesis is Competitive with High-Level
Synthesis
Small difference in speedup
- Binary synthesis competitive with high-level
synthesis - Binary speedup 8x, High-level speedup 8.2x
- High-level synthesis only 2.5 better
- Commercial products beginning to appear
- Critical Blue, Binachip
40Binary Synthesis with Software Compiler
Optimizations
- But, binaries generated with few optimizations
- Optimizations for software may hurt hardware
- Need new decompilation techniques
Hardware synthesized from optimized binary may be
inefficient
C code
SW Compiler
Optimized Binary
Binary Synthesis
uP
FPGA
41Loop Rerolling
Non-unrolled Loop
Unrolled Loop
- Problem Loop unrolling may cause inefficient
hardware - Longer synthesis times
- Super-linear heuristics
- Unrolling 100 times gt synthesis time is 1002
times longer - Larger area requirements
- Unrolling by compiler unlikely to match unrolling
by synthesis - Loop structure needed for advanced synthesis
techniques
Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6,
0(reg5) Add reg4, reg4, reg6 Add reg3, reg3,
1 Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6,
0(reg5) Add reg4, reg4, reg6 Add reg3, reg3,
1 Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6,
0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1
Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6,
0(reg5) Add reg4, reg4, reg6 Add reg3, reg3,
1 Beq reg3, 10, -5
Solution We introduce loop rerolling to undo
loop unrolling
42Loop Rerolling Identifying Unrolled Loops
- Idea - Identify consecutively repeating
instruction sequences
Original C Code
x x 1 for (i0 i lt 2 i)
aibi1 yx
Find Consecutive Repeating Substrings Adjacent
Nodes with Same Substring
43Loop Rerolling
Unrolled Loop Identificiation
Original C Code
Add r3, r3, 1 Ld r0, b(0) Add r1, r0, 1 St a(0),
r1 Ld r0, b(1) Add r1, r0, 1 St a(1), r1 Mov r4,
r3
x x 1 for (i0 i lt 2 i)
aibi1 yx
Average Speedup of 1.6x
44Strength Promotion
- Problem Strength reduction may cause inefficient
hardware
However, some of the strength reduction was
beneficial
Strength promotion lets synthesis decide on
strength reduction, not software compiler Average
Speedup of 1.5
45Multiple ISA/Optimization Results
- What about aggressive software compiler
optimizations? - May obscure binary, making decompilation
impossible - What about different instructions sets?
- Side effects may degrade hardware performance
Speedup
46High-level vs. Binary Synthesis Proprietary
H.264 Decoder
- High-level synthesis vs. binary synthesis
- Collaboration with Freescale Semiconductor
- H.264 Decoder
- MPEG-4 Part 10 Advanced Video Coding (AVC)
- 3x smaller than MPEG-2
- Better quality
H.264
MPEG2
47High-level vs. Binary Synthesis Proprietary
H.264 Decoder
- Binary synthesis was competitive with high-level
synthesis - High-level speedup 6.56x
- Binary speedup 6.55x
48Outline
- Introduction
- Warp Processing Overview
- Enabling Technology Binary Synthesis
- Key techniques for synthesis from binaries
- Decompilation
- Current and Future Directions
- Multi-Threaded Warp Processing
- Custom Communication
49Thread Warping - Overview
Architectural Trend Include more cores on
chip Result More multi-threaded applications
Warp FPGA
Profiler
OS schedules 4 threads to custom accelerators
µP
µP
Warp tools create custom accelerators for b( )
µP
µP
Warp Tools
OS
Remaining 8 threads placed in thread queue
3x more thread parallelism
50Thread Warping - Overview
Warp FPGA
Profiler
µP
µP
µP
µP
Warp Tools
OS
51Thread Warping - Results
Thread warping 120x faster than 4-uP (ARM) system
- Comparison of thread warping (TW) and multi-core
- Simulated multi-cores ranging from 4 to 64
- Thread warping 4 cores FPGA
52Warp Processing Custom Communication
NoC Network on a Chip provides communication
between multiple cores Benini,
DeMicheliHemaniKumar
Problem Best topology is application dependent
App1
Performance
µP
µP
Bus
Mesh
App2
µP
µP
Performance
Bus
Mesh
53Warp Processing Custom Communication
NoC Network on a Chip provides communication
between multiple cores Benini,
DeMicheliHemaniKumar
Problem Best topology is application dependent
App1
FPGA
Performance
Bus
Mesh
App2
Performance
Bus
Mesh
Warp processing can dynamically choose topology
2x to 100x improvement
Collaboration with Rakesh Kumar University of
Illinois, Urbana-Champaign Amoebic Computing
54Summary
55References
- Patent
- Warp Processor for Dynamic Hardware/Software
Partitioning. F. Vahid, R. Lysecky, G. Stitt.
Patent Pending, 2004 - Hardware/Software Partitioning of Software
Binaries G. Stitt and F. VahidIEEE/ACM
International Conference on Computer Aided Design
(ICCAD), 2002, pp. 164- 170. - Warp Processors R. Lysecky, G. Stitt, and F.
Vahid. ACM Transactions on Design Automation of
Electronic Systems (TODAES), 2006, Volume 11,
Number 3, pp. 659-681. - Binary Synthesis G. Stitt and F. Vahid Accepted
for publication in ACM Transactions on Design
Automation of Electronic Systems (TODAES) - Expandable Logic G. Stitt, F. Vahid Submitted
to IEEE/ACM Conference on Design Automation
(DAC), 2007. - New Decompilation Techniques for Binary-level
Co-processor Generation G. Stitt, F. Vahid
IEEE/ACM International Conference on Computer
Aided Design (ICCAD), 2005, pp. 547-554. - Hardware/Software Partitioning of Software
Binaries A Case Study of H.264 Decode G.Stitt,
F. Vahid, G. McGregor, B. Einloth IEEE/ACM/IFIP
International Conference on Hardware/Software
Codesign and System Synthesis (CODES/ISSS), 2005,
pp. 285-290. - A Decompilation Approach to Partitioning Software
for Microprocessor/FPGA Platforms. G. Stitt and
F. Vahid IEEE/ACM Design Automation and Test in
Europe (DATE), 2005, pp.396-397. - Dynamic Hardware/Software Partitioning A First
Approach G. Stitt, R. Lysecky and F. Vahid
IEEE/ACM Conference on Design Automation (DAC),
2003, pp. 250-255.
Supported by NSF, SRC, Intel, IBM, Xilinx