Warp Processing: Making FPGAs Ubiquitous via Invisible Synthesis presentation

About This Presentation

Transcript and Presenter's Notes

Title: Warp Processing: Making FPGAs Ubiquitous via Invisible Synthesis

1
Warp Processing Making FPGAs Ubiquitous via
Invisible Synthesis

Greg Stitt
Department of Electrical and Computer Engineering
University of Florida

2
Introduction

Improved performance enables new applications
Past decade - Mp3 players, portable game
consoles, cell phones, etc.
Future architectures - Speech/image recognition,
self-guiding cars, computation biology, etc.

3
Introduction

FPGAs (Field Programmable Gate Arrays)
Implement custom circuits
10x, 100x, even 1000x for scientific and embedded
apps
Najjar 04He, Lu, Sun 05Levine, Schmit
03Prasanna 06Stitt, Vahid 05,
But, FPGAs not mainstream
Warp Processing Goal Bring FPGAs into mainstream
Make FPGAs Invisible

FPGAs capable of large performance improvements
Performance
uP
FPGA
4
Introduction Hardware/Software Partitioning
C Code for FIR Filter
for (i0 i lt 16 i) yi ci
xi .. .. ..
for (i0 i lt 128 i) yi ci
xi .. .. ..

1000 cycles

5
Introduction High-level Synthesis

Problem Describing circuit using HDL is time
consuming/difficult
Solution High-level synthesis
Create circuit from high-level code
Gupta, DeMicheli 92Camposano, Wolf 91Rabaey
96Gajski, Dutt 92
Allows developers to use higher-level
specification
Potentially, enables synthesis for software
developers

6
Introduction High-level Synthesis

Problem Describing circuit using HDL is time
consuming/difficult
Solution High-level synthesis
Create circuit from high-level code
Gupta, DeMicheli 92Camposano, Wolf 91Rabaey
96Gajski, Dutt 92
Allows developers to use higher-level
specification
Potentially, enables synthesis for software
developers

7
Introduction High-level Synthesis

Problem Describing circuit using HDL is time
consuming/difficult
Solution High-level synthesis
Create circuit from high-level code
Gupta, DeMicheli 92Camposano, Wolf 91Rabaey
96Gajski, Dutt 92
Allows developers to use higher-level
specification
Potentially, enables synthesis for software
developers

for (i0 i lt 16 i) yi ci xi
8
Outline

Introduction
Warp Processing Overview
Enabling Technology Binary Synthesis
Key techniques for synthesis from binaries
Decompilation
Current and Future Directions
Multi-threaded Warp Processing
Custom Communication

9
Problems with High-Level Synthesis

Problem High-level synthesis is unattractive to
software developers
Requires specialized language
SystemC, NapaC, HandelC,
Requires specialized compiler
Spark, ROCCC, CatapultC,
Limited commercial success
Software developers reluctant to change tools

uP
FPGA
10
Warp Processing Invisible Synthesis

Solution Make synthesis invisible
2 Requirements
Standard software tool flow
Perform compilation before synthesis
Hide synthesis tool
Move synthesis on chip
Similar to dynamic binary translation
Transmeta
But, translate to hw

11
Warp Processing Invisible Synthesis

Solution Make synthesis invisible
2 Requirements
Standard software tool flow
Perform compilation before synthesis
Hide synthesis tool
Move synthesis on chip
Similar to dynamic binary translation
Transmeta
But, translate to hw

Warp processor looks like standard uP but
invisibly synthesizes hardware
uP
FPGA
12
Warp Processing Invisible Synthesis

Advantages
Supports all languages,compilers, IDEs
Supports synthesis of assembly code
Support synthesis of library code
Also, enables dynamic optimizations

Warp processor looks like standard uP but
invisibly synthesizes hardware
uP
FPGA
13
Warp Processing Background Basic Idea
1
Initially, software binary loaded into
instruction memory
Profiler
I Mem
µP
D
FPGA
On-chip CAD
14
Warp Processing Background Basic Idea
2
Microprocessor executes instructions in software
binary
Profiler
I Mem
µP
D
FPGA
On-chip CAD
15
Warp Processing Background Basic Idea
3
Profiler monitors instructions and detects
critical regions in binary
Profiler
Profiler
I Mem
µP
µP
beq
beq
beq
beq
beq
beq
beq
beq
beq
beq
add
add
add
add
add
add
add
add
add
add
D
FPGA
On-chip CAD
16
Warp Processing Background Basic Idea
4
On-chip CAD reads in critical region
Profiler
Profiler
I Mem
µP
µP
D
FPGA
On-chip CAD
On-chip CAD
17
Warp Processing Background Basic Idea
5
On-chip CAD converts critical region into control
data flow graph (CDFG)
Profiler
Profiler
I Mem
µP
µP
D
FPGA
Dynamic Part. Module (DPM)
On-chip CAD
18
Warp Processing Background Basic Idea
6
On-chip CAD synthesizes decompiled CDFG to a
custom (parallel) circuit
Profiler
Profiler
I Mem
µP
µP
D
FPGA
Dynamic Part. Module (DPM)
On-chip CAD
19
Warp Processing Background Basic Idea
7
On-chip CAD maps circuit onto FPGA
Profiler
Profiler
I Mem
µP
µP
D
FPGA
FPGA
Dynamic Part. Module (DPM)
On-chip CAD

20
Warp Processing Background Basic Idea
On-chip CAD replaces instructions in binary to
use hardware, causing performance and energy to
warp by an order of magnitude or more
8
Mov reg3, 0 Mov reg4, 0 loop // instructions
that interact with FPGA Ret reg4
Profiler
Profiler
I Mem
µP
µP
D
FPGA
FPGA
Dynamic Part. Module (DPM)
On-chip CAD

21
Expandable Logic
RAM
Expandable Logic
Expandable RAM
uP
Performance
22
Expandable Logic

Allows for customization of platforms
User can select FPGAs based on used applications

Performance
Application
Portable Gaming
23
Expandable Logic

Allows for customization of platforms
User can select FPGAs based on used applications

Performance
Application
Portable Gaming
24
Expandable Logic

Allows for customization of platforms
User can select FPGAs based on used applications

Performance
Application
Web Browser
No-FPGA

Platform designer doesnt have to decide on fixed
amount of FPGA.
User doesnt have to pay for FPGA that isnt
needed

25
Warp Processing Background Basic Technology

Challenge CAD tools normally require powerful
workstations
Develop extremely efficient on-chip CAD tools
Requires efficient synthesis
Requires specialized FPGA, physical design tools
(JIT FPGA compilation)
Lysecky FCCM05/DAC04, University of
Arizona

JIT FPGA compilation
26
Warp Processing Background On-Chip CAD
Tech. Map
Synthesis
Log. Opt.
RT Syn.
Place
Route
9.1 s
Xilinx ISE
27
Warp Processing Initial Results
- Embedded Applications

Average speedup of 6.3x
Achieved completely transparently
Also, energy savings of 66

28
Outline

Introduction
Warp Processing Overview
Enabling Technology Binary Synthesis
Key techniques for synthesis from binaries
Decompilation
Current and Future Directions
Multi-threaded Warp Processing
Custom Communication

29
Binary Synthesis
for (i0 i lt 128 i) yi ci
xi .. ..
for (i0 i lt 128 i) yi ci
xi .. ..

Warp processors perform synthesis from software
binary binary synthesis
Problem No high-level information
Synthesis needs high-level constructs
gt 10x slowdown
Can we recover high-level information for
synthesis?
Make binary synthesis (and Warp processing)
competitive with high-level synthesis

Addi r1, r0, 0 Ld r3, 256(r1) Ld r4, 512(r1) Subi
r2, r1, 128 Jnz r2, -5
No high-level constructs arrays, loops, etc.
Hardware can be gt 10x to 100x
30
Decompilation

We realized decompilation recovers high-level
information
But, generally used for binary translation or
source-code recovery
May not be suitable for synthesis
We studied existing approaches
Cifuentes 94, 99, 01Mycroft 99,01
DisC, dcc, Boomerang, Mocha, SourceAgain
Determined relevant techniques
Adapted existing techniques for synthesis

31
Decompilation Control/Data Flow Graph Recovery

Recovery of control/data flow graph (CDFG)
Format used by synthesis
Difficult because of indirect jumps
Cannot statically analyze control flow
But, heuristics are over 99 successful on
standard benchmarks
Cifuentes 99, 00

Corresponding Assembly
Control/Data Flow Graph Creation
Original C Code
reg3 0 reg4 0
Mov reg3, 0 Mov reg4, 0 loop Shl reg1, reg3,
1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4,
reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret
reg4
long f( short a10 ) long accum for
(int i0 i lt 10 i) accum ai
return accum
loop reg1 reg3 ltlt 1 reg5 reg2 reg1 reg6
memreg5 0 reg4 reg4 reg6 reg3 reg3
1 if (reg3 lt 10) goto loop
ret reg4
32
Decompilation Data Flow Analysis

Original purpose - remove temporary registers
Area overhead 130
Need new techniques for binary synthesis

Data Flow Analysis
Corresponding Assembly
Original C Code
reg3 0 reg4 0
Mov reg3, 0 Mov reg4, 0 loop Shl reg1, reg3,
1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4,
reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret
reg4
long f( short a10 ) long accum for
(int i0 i lt 10 i) accum ai
return accum
loop reg4 reg4 mem reg2 (reg3 ltlt
1) reg3 reg3 1 if (reg3 lt 10) goto loop
ret reg4
33
Decompilation Data Flow Analysis

Strength Reduction Compare-with-zero
instructions
Operator Size Reduction

Sub reg3, reg4, reg5 Bz reg3, -5
32-bit reg4
32-bit reg5
Lb reg4, 0(reg1) Mvi reg5, 16 Add reg3, reg4, reg5
32-bit
32-bit reg3
Area Overhead Reduced to 10
34
Decompilation Function Recovery

Recover parameters and return values
Def-use analysis of prologue/epilogue
100 success rate

Corresponding Assembly
Function Recovery
Original C Code
long f( long reg2 ) int reg3 0 int
reg4 0 loop reg4 reg4 memreg2
reg3 ltlt 1) reg3 reg3 1
if (reg3 lt 10) goto loop return reg4
Mov reg3, 0 Mov reg4, 0 loop Shl reg1, reg3,
1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4,
reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret
reg4
long f( short a10 ) long accum for
(int i0 i lt 10 i) accum ai
return accum
35
Decompilation Control Structure Recovery

Recover loops, if statements
Uses interval analysis techniques
Cifuentes 94
100 success rate

Corresponding Assembly
Control Structure Recovery
Original C Code
Mov reg3, 0 Mov reg4, 0 loop Shl reg1, reg3,
1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4,
reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret
reg4
long f( short a10 ) long accum for
(int i0 i lt 10 i) accum ai
return accum
long f( long reg2 ) long reg4 0 for
(long reg3 0 reg3 lt 10 reg3) reg4
memreg2 (reg3 ltlt 1) return
reg4
36
Decompilation Array Recovery

Detect linear memory patterns and row-major
ordering calculations
95 success rate
Stitt, Guo, Najjar, Vahid 05
Cifuentes 00

Corresponding Assembly
Array Recovery
Original C Code
Mov reg3, 0 Mov reg4, 0 loop Shl reg1, reg3,
1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4,
reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret
reg4
long f( short a10 ) long accum for
(int i0 i lt 10 i) accum ai
return accum
long f( short array10 ) long reg4 0
for (long reg3 0 reg3 lt 10 reg3)
reg4 arrayreg3 return reg4
37
Comparison of Decompiled Code and Original Code

Decompiled code almost identical to original code
Only difference is variable names
Binary synthesis is competitive with high-level
synthesis

Original C Code
Decompiled Code
long f( short a10 ) long accum for
(int i0 i lt 10 i) accum ai
return accum
long f( short array10 ) long reg4 0
for (long reg3 0 reg3 lt 10 reg3)
reg4 arrayreg3 return reg4
Almost Identical Representations
38
Binary Synthesis Tool Flow
Libraries/ Object Code
Libraries/ Object Code
uP
FPGA
30,000 lines of C code
39
Binary Synthesis is Competitive with High-Level
Synthesis
Small difference in speedup

Binary synthesis competitive with high-level
synthesis
Binary speedup 8x, High-level speedup 8.2x
High-level synthesis only 2.5 better
Commercial products beginning to appear
Critical Blue, Binachip

40
Binary Synthesis with Software Compiler
Optimizations

But, binaries generated with few optimizations
Optimizations for software may hurt hardware
Need new decompilation techniques

Hardware synthesized from optimized binary may be
inefficient
C code
SW Compiler
Optimized Binary
Binary Synthesis
uP
FPGA
41
Loop Rerolling
Non-unrolled Loop
Unrolled Loop

Problem Loop unrolling may cause inefficient
hardware
Longer synthesis times
Super-linear heuristics
Unrolling 100 times gt synthesis time is 1002
times longer
Larger area requirements
Unrolling by compiler unlikely to match unrolling
by synthesis
Loop structure needed for advanced synthesis
techniques

Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6,
0(reg5) Add reg4, reg4, reg6 Add reg3, reg3,
1 Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6,
0(reg5) Add reg4, reg4, reg6 Add reg3, reg3,
1 Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6,
0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1
Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6,
0(reg5) Add reg4, reg4, reg6 Add reg3, reg3,
1 Beq reg3, 10, -5
Solution We introduce loop rerolling to undo
loop unrolling
42
Loop Rerolling Identifying Unrolled Loops

Idea - Identify consecutively repeating
instruction sequences

Original C Code
x x 1 for (i0 i lt 2 i)
aibi1 yx
Find Consecutive Repeating Substrings Adjacent
Nodes with Same Substring
43
Loop Rerolling
Unrolled Loop Identificiation
Original C Code
Add r3, r3, 1 Ld r0, b(0) Add r1, r0, 1 St a(0),
r1 Ld r0, b(1) Add r1, r0, 1 St a(1), r1 Mov r4,
r3
x x 1 for (i0 i lt 2 i)
aibi1 yx
Average Speedup of 1.6x
44
Strength Promotion

Problem Strength reduction may cause inefficient
hardware

However, some of the strength reduction was
beneficial
Strength promotion lets synthesis decide on
strength reduction, not software compiler Average
Speedup of 1.5
45
Multiple ISA/Optimization Results

What about aggressive software compiler
optimizations?
May obscure binary, making decompilation
impossible
What about different instructions sets?
Side effects may degrade hardware performance

Speedup
46
High-level vs. Binary Synthesis Proprietary
H.264 Decoder

High-level synthesis vs. binary synthesis
Collaboration with Freescale Semiconductor
H.264 Decoder
MPEG-4 Part 10 Advanced Video Coding (AVC)
3x smaller than MPEG-2
Better quality

H.264
MPEG2
47
High-level vs. Binary Synthesis Proprietary
H.264 Decoder

Binary synthesis was competitive with high-level
synthesis
High-level speedup 6.56x
Binary speedup 6.55x

48
Outline

Introduction
Warp Processing Overview
Enabling Technology Binary Synthesis
Key techniques for synthesis from binaries
Decompilation
Current and Future Directions
Multi-Threaded Warp Processing
Custom Communication

49
Thread Warping - Overview
Architectural Trend Include more cores on
chip Result More multi-threaded applications
Warp FPGA
Profiler
OS schedules 4 threads to custom accelerators
µP
µP
Warp tools create custom accelerators for b( )
µP
µP
Warp Tools
OS
Remaining 8 threads placed in thread queue
3x more thread parallelism
50
Thread Warping - Overview
Warp FPGA
Profiler
µP
µP
µP
µP
Warp Tools
OS
51
Thread Warping - Results
Thread warping 120x faster than 4-uP (ARM) system

Comparison of thread warping (TW) and multi-core
Simulated multi-cores ranging from 4 to 64
Thread warping 4 cores FPGA

52
Warp Processing Custom Communication
NoC Network on a Chip provides communication
between multiple cores Benini,
DeMicheliHemaniKumar
Problem Best topology is application dependent
App1
Performance
µP
µP
Bus
Mesh
App2
µP
µP
Performance
Bus
Mesh
53
Warp Processing Custom Communication
NoC Network on a Chip provides communication
between multiple cores Benini,
DeMicheliHemaniKumar
Problem Best topology is application dependent
App1
FPGA
Performance
Bus
Mesh
App2
Performance
Bus
Mesh
Warp processing can dynamically choose topology
2x to 100x improvement
Collaboration with Rakesh Kumar University of
Illinois, Urbana-Champaign Amoebic Computing
54
Summary
55
References

Patent
Warp Processor for Dynamic Hardware/Software
Partitioning. F. Vahid, R. Lysecky, G. Stitt.
Patent Pending, 2004
Hardware/Software Partitioning of Software
Binaries G. Stitt and F. VahidIEEE/ACM
International Conference on Computer Aided Design
(ICCAD), 2002, pp. 164- 170.
Warp Processors R. Lysecky, G. Stitt, and F.
Vahid. ACM Transactions on Design Automation of
Electronic Systems (TODAES), 2006, Volume 11,
Number 3, pp. 659-681.
Binary Synthesis G. Stitt and F. Vahid Accepted
for publication in ACM Transactions on Design
Automation of Electronic Systems (TODAES)
Expandable Logic G. Stitt, F. Vahid Submitted
to IEEE/ACM Conference on Design Automation
(DAC), 2007.
New Decompilation Techniques for Binary-level
Co-processor Generation G. Stitt, F. Vahid
IEEE/ACM International Conference on Computer
Aided Design (ICCAD), 2005, pp. 547-554.
Hardware/Software Partitioning of Software
Binaries A Case Study of H.264 Decode G.Stitt,
F. Vahid, G. McGregor, B. Einloth IEEE/ACM/IFIP
International Conference on Hardware/Software
Codesign and System Synthesis (CODES/ISSS), 2005,
pp. 285-290.
A Decompilation Approach to Partitioning Software
for Microprocessor/FPGA Platforms. G. Stitt and
F. Vahid IEEE/ACM Design Automation and Test in
Europe (DATE), 2005, pp.396-397.
Dynamic Hardware/Software Partitioning A First
Approach G. Stitt, R. Lysecky and F. Vahid
IEEE/ACM Conference on Design Automation (DAC),
2003, pp. 250-255.

Supported by NSF, SRC, Intel, IBM, Xilinx

Write a Comment

User Comments (0)

About PowerShow.com

Warp Processing: Making FPGAs Ubiquitous via Invisible Synthesis PowerPoint PPT Presentation