Warp Processing: Making FPGAs Ubiquitous via Invisible Synthesis - PowerPoint PPT Presentation

About This Presentation
Title:

Warp Processing: Making FPGAs Ubiquitous via Invisible Synthesis

Description:

Warp Processing: Making FPGAs Ubiquitous via Invisible Synthesis Greg Stitt Department of Electrical and Computer Engineering University of Florida – PowerPoint PPT presentation

Number of Views:122
Avg rating:3.0/5.0
Slides: 56
Provided by: Roma159
Category:

less

Transcript and Presenter's Notes

Title: Warp Processing: Making FPGAs Ubiquitous via Invisible Synthesis


1
Warp Processing Making FPGAs Ubiquitous via
Invisible Synthesis
  • Greg Stitt
  • Department of Electrical and Computer Engineering
  • University of Florida

2
Introduction
  • Improved performance enables new applications
  • Past decade - Mp3 players, portable game
    consoles, cell phones, etc.
  • Future architectures - Speech/image recognition,
    self-guiding cars, computation biology, etc.

3
Introduction
  • FPGAs (Field Programmable Gate Arrays)
    Implement custom circuits
  • 10x, 100x, even 1000x for scientific and embedded
    apps
  • Najjar 04He, Lu, Sun 05Levine, Schmit
    03Prasanna 06Stitt, Vahid 05,
  • But, FPGAs not mainstream
  • Warp Processing Goal Bring FPGAs into mainstream
  • Make FPGAs Invisible

FPGAs capable of large performance improvements
Performance
uP
FPGA
4
Introduction Hardware/Software Partitioning
C Code for FIR Filter
for (i0 i lt 16 i) yi ci
xi .. .. ..
for (i0 i lt 128 i) yi ci
xi .. .. ..
  • 1000 cycles

5
Introduction High-level Synthesis
  • Problem Describing circuit using HDL is time
    consuming/difficult
  • Solution High-level synthesis
  • Create circuit from high-level code
  • Gupta, DeMicheli 92Camposano, Wolf 91Rabaey
    96Gajski, Dutt 92
  • Allows developers to use higher-level
    specification
  • Potentially, enables synthesis for software
    developers

6
Introduction High-level Synthesis
  • Problem Describing circuit using HDL is time
    consuming/difficult
  • Solution High-level synthesis
  • Create circuit from high-level code
  • Gupta, DeMicheli 92Camposano, Wolf 91Rabaey
    96Gajski, Dutt 92
  • Allows developers to use higher-level
    specification
  • Potentially, enables synthesis for software
    developers

7
Introduction High-level Synthesis
  • Problem Describing circuit using HDL is time
    consuming/difficult
  • Solution High-level synthesis
  • Create circuit from high-level code
  • Gupta, DeMicheli 92Camposano, Wolf 91Rabaey
    96Gajski, Dutt 92
  • Allows developers to use higher-level
    specification
  • Potentially, enables synthesis for software
    developers

for (i0 i lt 16 i) yi ci xi
8
Outline
  • Introduction
  • Warp Processing Overview
  • Enabling Technology Binary Synthesis
  • Key techniques for synthesis from binaries
  • Decompilation
  • Current and Future Directions
  • Multi-threaded Warp Processing
  • Custom Communication

9
Problems with High-Level Synthesis
  • Problem High-level synthesis is unattractive to
    software developers
  • Requires specialized language
  • SystemC, NapaC, HandelC,
  • Requires specialized compiler
  • Spark, ROCCC, CatapultC,
  • Limited commercial success
  • Software developers reluctant to change tools

uP
FPGA
10
Warp Processing Invisible Synthesis
  • Solution Make synthesis invisible
  • 2 Requirements
  • Standard software tool flow
  • Perform compilation before synthesis
  • Hide synthesis tool
  • Move synthesis on chip
  • Similar to dynamic binary translation
  • Transmeta
  • But, translate to hw

11
Warp Processing Invisible Synthesis
  • Solution Make synthesis invisible
  • 2 Requirements
  • Standard software tool flow
  • Perform compilation before synthesis
  • Hide synthesis tool
  • Move synthesis on chip
  • Similar to dynamic binary translation
  • Transmeta
  • But, translate to hw

Warp processor looks like standard uP but
invisibly synthesizes hardware
uP
FPGA
12
Warp Processing Invisible Synthesis
  • Advantages
  • Supports all languages,compilers, IDEs
  • Supports synthesis of assembly code
  • Support synthesis of library code
  • Also, enables dynamic optimizations

Warp processor looks like standard uP but
invisibly synthesizes hardware
uP
FPGA
13
Warp Processing Background Basic Idea
1
Initially, software binary loaded into
instruction memory
Profiler
I Mem
µP
D
FPGA
On-chip CAD
14
Warp Processing Background Basic Idea
2
Microprocessor executes instructions in software
binary
Profiler
I Mem
µP
D
FPGA
On-chip CAD
15
Warp Processing Background Basic Idea
3
Profiler monitors instructions and detects
critical regions in binary
Profiler
Profiler
I Mem
µP
µP
beq
beq
beq
beq
beq
beq
beq
beq
beq
beq
add
add
add
add
add
add
add
add
add
add
D
FPGA
On-chip CAD
16
Warp Processing Background Basic Idea
4
On-chip CAD reads in critical region
Profiler
Profiler
I Mem
µP
µP
D
FPGA
On-chip CAD
On-chip CAD
17
Warp Processing Background Basic Idea
5
On-chip CAD converts critical region into control
data flow graph (CDFG)
Profiler
Profiler
I Mem
µP
µP
D
FPGA
Dynamic Part. Module (DPM)
On-chip CAD
18
Warp Processing Background Basic Idea
6
On-chip CAD synthesizes decompiled CDFG to a
custom (parallel) circuit
Profiler
Profiler
I Mem
µP
µP
D
FPGA
Dynamic Part. Module (DPM)
On-chip CAD
19
Warp Processing Background Basic Idea
7
On-chip CAD maps circuit onto FPGA
Profiler
Profiler
I Mem
µP
µP
D
FPGA
FPGA
Dynamic Part. Module (DPM)
On-chip CAD


20
Warp Processing Background Basic Idea
On-chip CAD replaces instructions in binary to
use hardware, causing performance and energy to
warp by an order of magnitude or more
8
Mov reg3, 0 Mov reg4, 0 loop // instructions
that interact with FPGA Ret reg4
Profiler
Profiler
I Mem
µP
µP
D
FPGA
FPGA
Dynamic Part. Module (DPM)
On-chip CAD


21
Expandable Logic
RAM
Expandable Logic
Expandable RAM
uP
Performance
22
Expandable Logic
  • Allows for customization of platforms
  • User can select FPGAs based on used applications

Performance
Application
Portable Gaming
23
Expandable Logic
  • Allows for customization of platforms
  • User can select FPGAs based on used applications

Performance
Application
Portable Gaming
24
Expandable Logic
  • Allows for customization of platforms
  • User can select FPGAs based on used applications

Performance
Application
Web Browser
No-FPGA
  • Platform designer doesnt have to decide on fixed
    amount of FPGA.
  • User doesnt have to pay for FPGA that isnt
    needed

25
Warp Processing Background Basic Technology
  • Challenge CAD tools normally require powerful
    workstations
  • Develop extremely efficient on-chip CAD tools
  • Requires efficient synthesis
  • Requires specialized FPGA, physical design tools
    (JIT FPGA compilation)
  • Lysecky FCCM05/DAC04, University of
    Arizona

JIT FPGA compilation
26
Warp Processing Background On-Chip CAD
Tech. Map
Synthesis
Log. Opt.
RT Syn.
Place
Route
9.1 s
Xilinx ISE
27
Warp Processing Initial Results
- Embedded Applications
  • Average speedup of 6.3x
  • Achieved completely transparently
  • Also, energy savings of 66

28
Outline
  • Introduction
  • Warp Processing Overview
  • Enabling Technology Binary Synthesis
  • Key techniques for synthesis from binaries
  • Decompilation
  • Current and Future Directions
  • Multi-threaded Warp Processing
  • Custom Communication

29
Binary Synthesis
for (i0 i lt 128 i) yi ci
xi .. ..
for (i0 i lt 128 i) yi ci
xi .. ..
  • Warp processors perform synthesis from software
    binary binary synthesis
  • Problem No high-level information
  • Synthesis needs high-level constructs
  • gt 10x slowdown
  • Can we recover high-level information for
    synthesis?
  • Make binary synthesis (and Warp processing)
    competitive with high-level synthesis

Addi r1, r0, 0 Ld r3, 256(r1) Ld r4, 512(r1) Subi
r2, r1, 128 Jnz r2, -5
No high-level constructs arrays, loops, etc.
Hardware can be gt 10x to 100x
30
Decompilation
  • We realized decompilation recovers high-level
    information
  • But, generally used for binary translation or
    source-code recovery
  • May not be suitable for synthesis
  • We studied existing approaches
  • Cifuentes 94, 99, 01Mycroft 99,01
  • DisC, dcc, Boomerang, Mocha, SourceAgain
  • Determined relevant techniques
  • Adapted existing techniques for synthesis

31
Decompilation Control/Data Flow Graph Recovery
  • Recovery of control/data flow graph (CDFG)
  • Format used by synthesis
  • Difficult because of indirect jumps
  • Cannot statically analyze control flow
  • But, heuristics are over 99 successful on
    standard benchmarks
  • Cifuentes 99, 00

Corresponding Assembly
Control/Data Flow Graph Creation
Original C Code
reg3 0 reg4 0
Mov reg3, 0 Mov reg4, 0 loop Shl reg1, reg3,
1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4,
reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret
reg4
long f( short a10 ) long accum for
(int i0 i lt 10 i) accum ai
return accum
loop reg1 reg3 ltlt 1 reg5 reg2 reg1 reg6
memreg5 0 reg4 reg4 reg6 reg3 reg3
1 if (reg3 lt 10) goto loop
ret reg4
32
Decompilation Data Flow Analysis
  • Original purpose - remove temporary registers
  • Area overhead 130
  • Need new techniques for binary synthesis

Data Flow Analysis
Corresponding Assembly
Original C Code
reg3 0 reg4 0
Mov reg3, 0 Mov reg4, 0 loop Shl reg1, reg3,
1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4,
reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret
reg4
long f( short a10 ) long accum for
(int i0 i lt 10 i) accum ai
return accum
loop reg4 reg4 mem reg2 (reg3 ltlt
1) reg3 reg3 1 if (reg3 lt 10) goto loop
ret reg4
33
Decompilation Data Flow Analysis
  • Strength Reduction Compare-with-zero
    instructions
  • Operator Size Reduction

Sub reg3, reg4, reg5 Bz reg3, -5
32-bit reg4
32-bit reg5
Lb reg4, 0(reg1) Mvi reg5, 16 Add reg3, reg4, reg5
32-bit
32-bit reg3
Area Overhead Reduced to 10
34
Decompilation Function Recovery
  • Recover parameters and return values
  • Def-use analysis of prologue/epilogue
  • 100 success rate

Corresponding Assembly
Function Recovery
Original C Code
long f( long reg2 ) int reg3 0 int
reg4 0 loop reg4 reg4 memreg2
reg3 ltlt 1) reg3 reg3 1
if (reg3 lt 10) goto loop return reg4
Mov reg3, 0 Mov reg4, 0 loop Shl reg1, reg3,
1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4,
reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret
reg4
long f( short a10 ) long accum for
(int i0 i lt 10 i) accum ai
return accum
35
Decompilation Control Structure Recovery
  • Recover loops, if statements
  • Uses interval analysis techniques
  • Cifuentes 94
  • 100 success rate

Corresponding Assembly
Control Structure Recovery
Original C Code
Mov reg3, 0 Mov reg4, 0 loop Shl reg1, reg3,
1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4,
reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret
reg4
long f( short a10 ) long accum for
(int i0 i lt 10 i) accum ai
return accum
long f( long reg2 ) long reg4 0 for
(long reg3 0 reg3 lt 10 reg3) reg4
memreg2 (reg3 ltlt 1) return
reg4
36
Decompilation Array Recovery
  • Detect linear memory patterns and row-major
    ordering calculations
  • 95 success rate
  • Stitt, Guo, Najjar, Vahid 05
  • Cifuentes 00

Corresponding Assembly
Array Recovery
Original C Code
Mov reg3, 0 Mov reg4, 0 loop Shl reg1, reg3,
1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4,
reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret
reg4
long f( short a10 ) long accum for
(int i0 i lt 10 i) accum ai
return accum
long f( short array10 ) long reg4 0
for (long reg3 0 reg3 lt 10 reg3)
reg4 arrayreg3 return reg4
37
Comparison of Decompiled Code and Original Code
  • Decompiled code almost identical to original code
  • Only difference is variable names
  • Binary synthesis is competitive with high-level
    synthesis

Original C Code
Decompiled Code
long f( short a10 ) long accum for
(int i0 i lt 10 i) accum ai
return accum
long f( short array10 ) long reg4 0
for (long reg3 0 reg3 lt 10 reg3)
reg4 arrayreg3 return reg4
Almost Identical Representations
38
Binary Synthesis Tool Flow
Libraries/ Object Code
Libraries/ Object Code
uP
FPGA
30,000 lines of C code
39
Binary Synthesis is Competitive with High-Level
Synthesis
Small difference in speedup
  • Binary synthesis competitive with high-level
    synthesis
  • Binary speedup 8x, High-level speedup 8.2x
  • High-level synthesis only 2.5 better
  • Commercial products beginning to appear
  • Critical Blue, Binachip

40
Binary Synthesis with Software Compiler
Optimizations
  • But, binaries generated with few optimizations
  • Optimizations for software may hurt hardware
  • Need new decompilation techniques

Hardware synthesized from optimized binary may be
inefficient
C code
SW Compiler
Optimized Binary
Binary Synthesis
uP
FPGA
41
Loop Rerolling
Non-unrolled Loop
Unrolled Loop
  • Problem Loop unrolling may cause inefficient
    hardware
  • Longer synthesis times
  • Super-linear heuristics
  • Unrolling 100 times gt synthesis time is 1002
    times longer
  • Larger area requirements
  • Unrolling by compiler unlikely to match unrolling
    by synthesis
  • Loop structure needed for advanced synthesis
    techniques

Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6,
0(reg5) Add reg4, reg4, reg6 Add reg3, reg3,
1 Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6,
0(reg5) Add reg4, reg4, reg6 Add reg3, reg3,
1 Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6,
0(reg5) Add reg4, reg4, reg6 Add reg3, reg3, 1
Shl reg1, reg3, 1 Add reg5, reg2, reg1 Ld reg6,
0(reg5) Add reg4, reg4, reg6 Add reg3, reg3,
1 Beq reg3, 10, -5
Solution We introduce loop rerolling to undo
loop unrolling
42
Loop Rerolling Identifying Unrolled Loops
  • Idea - Identify consecutively repeating
    instruction sequences

Original C Code
x x 1 for (i0 i lt 2 i)
aibi1 yx
Find Consecutive Repeating Substrings Adjacent
Nodes with Same Substring
43
Loop Rerolling
Unrolled Loop Identificiation
Original C Code
Add r3, r3, 1 Ld r0, b(0) Add r1, r0, 1 St a(0),
r1 Ld r0, b(1) Add r1, r0, 1 St a(1), r1 Mov r4,
r3
x x 1 for (i0 i lt 2 i)
aibi1 yx
Average Speedup of 1.6x
44
Strength Promotion
  • Problem Strength reduction may cause inefficient
    hardware

However, some of the strength reduction was
beneficial
Strength promotion lets synthesis decide on
strength reduction, not software compiler Average
Speedup of 1.5
45
Multiple ISA/Optimization Results
  • What about aggressive software compiler
    optimizations?
  • May obscure binary, making decompilation
    impossible
  • What about different instructions sets?
  • Side effects may degrade hardware performance

Speedup
46
High-level vs. Binary Synthesis Proprietary
H.264 Decoder
  • High-level synthesis vs. binary synthesis
  • Collaboration with Freescale Semiconductor
  • H.264 Decoder
  • MPEG-4 Part 10 Advanced Video Coding (AVC)
  • 3x smaller than MPEG-2
  • Better quality

H.264
MPEG2
47
High-level vs. Binary Synthesis Proprietary
H.264 Decoder
  • Binary synthesis was competitive with high-level
    synthesis
  • High-level speedup 6.56x
  • Binary speedup 6.55x

48
Outline
  • Introduction
  • Warp Processing Overview
  • Enabling Technology Binary Synthesis
  • Key techniques for synthesis from binaries
  • Decompilation
  • Current and Future Directions
  • Multi-Threaded Warp Processing
  • Custom Communication

49
Thread Warping - Overview
Architectural Trend Include more cores on
chip Result More multi-threaded applications
Warp FPGA
Profiler
OS schedules 4 threads to custom accelerators
µP
µP
Warp tools create custom accelerators for b( )
µP
µP
Warp Tools
OS
Remaining 8 threads placed in thread queue
3x more thread parallelism
50
Thread Warping - Overview
Warp FPGA
Profiler
µP
µP
µP
µP
Warp Tools
OS
51
Thread Warping - Results
Thread warping 120x faster than 4-uP (ARM) system
  • Comparison of thread warping (TW) and multi-core
  • Simulated multi-cores ranging from 4 to 64
  • Thread warping 4 cores FPGA

52
Warp Processing Custom Communication
NoC Network on a Chip provides communication
between multiple cores Benini,
DeMicheliHemaniKumar
Problem Best topology is application dependent
App1
Performance
µP
µP
Bus
Mesh
App2
µP
µP
Performance
Bus
Mesh
53
Warp Processing Custom Communication
NoC Network on a Chip provides communication
between multiple cores Benini,
DeMicheliHemaniKumar
Problem Best topology is application dependent
App1
FPGA
Performance
Bus
Mesh
App2
Performance
Bus
Mesh
Warp processing can dynamically choose topology
2x to 100x improvement
Collaboration with Rakesh Kumar University of
Illinois, Urbana-Champaign Amoebic Computing
54
Summary
55
References
  • Patent
  • Warp Processor for Dynamic Hardware/Software
    Partitioning. F. Vahid, R. Lysecky, G. Stitt.
    Patent Pending, 2004
  • Hardware/Software Partitioning of Software
    Binaries G. Stitt and F. VahidIEEE/ACM
    International Conference on Computer Aided Design
    (ICCAD), 2002, pp. 164- 170.
  • Warp Processors R. Lysecky, G. Stitt, and F.
    Vahid. ACM Transactions on Design Automation of
    Electronic Systems (TODAES), 2006, Volume 11,
    Number 3, pp. 659-681.
  • Binary Synthesis G. Stitt and F. Vahid Accepted
    for publication in ACM Transactions on Design
    Automation of Electronic Systems (TODAES)
  • Expandable Logic G. Stitt, F. Vahid Submitted
    to IEEE/ACM Conference on Design Automation
    (DAC), 2007.
  • New Decompilation Techniques for Binary-level
    Co-processor Generation G. Stitt, F. Vahid
    IEEE/ACM International Conference on Computer
    Aided Design (ICCAD), 2005, pp. 547-554.
  • Hardware/Software Partitioning of Software
    Binaries A Case Study of H.264 Decode G.Stitt,
    F. Vahid, G. McGregor, B. Einloth IEEE/ACM/IFIP
    International Conference on Hardware/Software
    Codesign and System Synthesis (CODES/ISSS), 2005,
    pp. 285-290.
  • A Decompilation Approach to Partitioning Software
    for Microprocessor/FPGA Platforms. G. Stitt and
    F. Vahid IEEE/ACM Design Automation and Test in
    Europe (DATE), 2005, pp.396-397.
  • Dynamic Hardware/Software Partitioning A First
    Approach G. Stitt, R. Lysecky and F. Vahid
    IEEE/ACM Conference on Design Automation (DAC),
    2003, pp. 250-255.

Supported by NSF, SRC, Intel, IBM, Xilinx
Write a Comment
User Comments (0)
About PowerShow.com