Warp Processors

About This Presentation

Title:

Warp Processors

Description:

... -04 0.04 1000000.00 3.73e-04 841.60 0.04 1000000.00 423.53 -0.88 120702.00 40234.00 3.00 31300.00 3.00 93900.00 1700.00 26802.00 28502.00 4.50 4.24 100.00 100 ... – PowerPoint PPT presentation

Number of Views:53

Avg rating:3.0/5.0

Slides: 40

Provided by: FrankV156

Learn more at: http://www.cs.ucr.edu

Category:

more less

Transcript and Presenter's Notes

Title: Warp Processors

1
Warp Processors

Frank Vahid (Task Leader)
Department of Computer Science and Engineering
University of California, Riverside
Associate Director, Center for Embedded Computer
Systems, UC Irvine
Task ID 1331.001 July 2005 June 2008
Ph.D. students
Greg Stitt Ph.D. expected June 2006
Ann Gordon-Ross Ph.D. expected June 2006
David Sheldon Ph.D. expected 2009
Ryan Mannion Ph.D. expected 2009
Scott Sirowy Ph.D. expected 2010
Industrial Liaisons
Brian W. Einloth, Motorola
Serge Rutman, Dave Clark, Intel
Jeff Welser, IBM

2
Task Description

Warp processing background
Two seed SRC CSR grants (2002-2005) showed
feasibility
Idea Transparently move critical binary regions
from microprocessor to FPGA ? 10x perf./energy
gains or more
Task Mature warp technology
Years 1/2 (in progress)
Automatic high-level construct recovery from
binaries
In-depth case studies (with Freescale)
Also discovered unanticipated problem, developed
solution
Warp-tailored FPGA prototype (with Intel)
Years 2/3
Reduce memory bottleneck by using smart buffer
Investigate domain-specific-FPGA concepts (with
Freescale)
Consider desktop/server domains (with IBM)

3
Warp Processing Background Basic Idea
1
Initially, software binary loaded into
instruction memory
Profiler
I Mem
µP
D
FPGA
On-chip CAD
4
Warp Processing Background Basic Idea
2
Microprocessor executes instructions in software
binary
Profiler
I Mem
µP
D
FPGA
On-chip CAD
5
Warp Processing Background Basic Idea
3
Profiler monitors instructions and detects
critical regions in binary
Profiler
Profiler
I Mem
µP
µP
beq
beq
beq
beq
beq
beq
beq
beq
beq
beq
add
add
add
add
add
add
add
add
add
add
D
FPGA
On-chip CAD
6
Warp Processing Background Basic Idea
4
On-chip CAD reads in critical region
Profiler
Profiler
I Mem
µP
µP
D
FPGA
On-chip CAD
On-chip CAD
7
Warp Processing Background Basic Idea
5
On-chip CAD decompiles critical region into
control data flow graph (CDFG)
Profiler
Profiler
I Mem
µP
µP
D
FPGA
Dynamic Part. Module (DPM)
On-chip CAD
8
Warp Processing Background Basic Idea
6
On-chip CAD synthesizes decompiled CDFG to a
custom (parallel) circuit
Profiler
Profiler
I Mem
µP
µP
D
FPGA
Dynamic Part. Module (DPM)
On-chip CAD
9
Warp Processing Background Basic Idea
7
On-chip CAD maps circuit onto FPGA
Profiler
Profiler
I Mem
µP
µP
D
FPGA
FPGA
Dynamic Part. Module (DPM)
On-chip CAD

10
Warp Processing Background Basic Idea
On-chip CAD replaces instructions in binary to
use hardware, causing performance and energy to
warp by an order of magnitude or more
8
Mov reg3, 0 Mov reg4, 0 loop // instructions
that interact with FPGA Ret reg4
Profiler
Profiler
I Mem
µP
µP
D
FPGA
FPGA
Dynamic Part. Module (DPM)
On-chip CAD

11
Warp Processing Background Trend Towards
Processor/FPGA Programmable Platforms

FPGAs with hard core processors
FPGAs with soft core processors
Computer boards with FPGAs

Xilinx Virtex II Pro. Source Xilinx
Altera Excalibur. Source Altera
Xilinx Spartan. Source Xilinx
Cray XD1. Source FPGA journal, Apr05
12
Warp Processing Background Trend Towards
Processor/FPGA Programmable Platforms

Programming a key challenge
Soln 1 Compile high-level language to custom
binaries
Soln 2 Use standard binaries, dynamically re-map
(warp)
Cons
Less high-level information, less optimization
Pros
Available to all software developers, not just
specialists
Data dependent optimization
Most importantly, standard binaries enable
ecosystem among tools, architecture, and
applications

Xilinx Virtex II Pro. Source Xilinx
Most significant concept presently absent in
FPGAs and other new programmable platforms
13
Warp Processing Background Basic Technology

Warp processing
On-chip profiler
Warp-tuned FPGA
On-chip CAD, including Just-in-Time FPGA
compilation

JIT FPGA compilation
14
Warp Processing Background Initial Results
Tech. Map
Decomp.
Partitioning
Log. Syn.
RT Syn.
Place
Route
9.1 s
Xilinx ISE
15
Warp Processing Background Publications 2002-2005

On-chip profiler
Frequent Loop Detection Using Efficient
Non-Intrusive On-Chip Hardware, A. Gordon-Ross
and F. Vahid, ACM/IEEE Conf. on Compilers,
Architecture and Synthesis for Embedded Systems
(CASES), 2003
Extended version of above in special issue Best
of CASES/MICRO of IEEE Trans. on Comp., Oct
2005.
Warp-tuned FPGA
A Configurable Logic Architecture for Dynamic
Hardware/Software Partitioning, R. Lysecky and F.
Vahid, Design Automation and Test in Europe Conf.
(DATE), Feb 2004.
On-chip CAD, including Just-in-Time FPGA
compilation
A Study of the Scalability of On-Chip Routing for
Just-in-Time FPGA Compilation. R. Lysecky, F.
Vahid and S. Tan. IEEE Symp. on
Field-Programmable Custom Computing Machines
(FCCM), 2005.
A Study of the Speedups and Competitiveness of
FPGA Soft Processor Cores using Dynamic
Hardware/Software Partitioning. R. Lysecky and F.
Vahid. Design Automation and Test in Europe
(DATE), March 2005.
Dynamic FPGA Routing for Just-in-Time FPGA
Compilation. R. Lysecky, F. Vahid, and S. Tan.
Design Automation Conf. (DAC), June 2004.
A Codesigned On-Chip Logic Minimizer, R. Lysecky
and F. Vahid, ISSS/CODES conf., Oct 2003.
Dynamic Hardware/Software Partitioning A First
Approach. G. Stitt, R. Lysecky and F. Vahid,
Design Automation Conf. (DAC), 2003.
On-Chip Logic Minimization, R. Lysecky and F.
Vahid, Design Automation Conf. (DAC), 2003.
The Energy Advantages of Microprocessor Platforms
with On-Chip Configurable Logic, G. Stitt and F.
Vahid, IEEE Design and Test of Computers,
Nov./Dec. 2002.
Hardware/Software Partitioning of Software
Binaries, G. Stitt and F. Vahid, IEEE/ACM
International Conference on Computer Aided Design
(ICCAD), Nov. 2002.
Related
A Self-Tuning Cache Architecture for Embedded
Systems. C. Zhang, F. Vahid and R. Lysecky. ACM
Transactions on Embedded Computing Systems
(TECS), Vol. 3., Issue 2, May 2004.
Fast Configurable-Cache Tuning with a Unified
Second-Level Cache. A. Gordon-Ross, F. Vahid, N.
Dutt. Int. Symp. on Low-Power Electronics and
Design (ISLPED), 2005.

16
Task Description

Warp processing background
Two seed SRC CSR grants (2002-2005) showed
feasibility
Idea Transparently move critical binary regions
from microprocessor to FPGA ? 10x perf./energy
gains or more
Task Mature warp technology
Year 1 (in progress)
Automatic high-level construct recovery from
binaries
In-depth case studies (with Freescale)
Also discovered unanticipated problem, developed
solution
Warp-tailored FPGA prototype (with Intel)
Years 2/3
Reduce memory bottleneck by using smart buffer
Investigate domain-specific-FPGA concepts (with
Freescale)
Consider desktop/server domains (with IBM)

17
Automatic High-Level Construct Recovery from
Binaries

Challenge Binary lacks high-level constructs
(loops, arrays, ...)
Decompilation can help recover
Extensive previous work (e.g., Cifuentes 93, 94,
99)

Corresponding Assembly
Original C Code
Mov reg3, 0 Mov reg4, 0 loop Shl reg1, reg3,
1 Add reg5, reg2, reg1 Ld reg6, 0(reg5) Add reg4,
reg4, reg6 Add reg3, reg3, 1 Beq reg3, 10, -5 Ret
reg4
long f( short a10 ) long accum for
(int i0 i lt 10 i) accum ai
return accum
18
New Method Loop Rerolling

Problem Compiler unrolling of loops (to expose
parallelism) causes synthesis problems
Huge input (slow), cant unroll to desired
amount, cant use advanced loop methods (loop
pipelining, fusion, splitting, ...)
Solution New decompilation method Loop
Rerolling
Identify unrolled iterations, compact into one
iteration

Loop Unrolling
Ld reg2, 100(0) Add reg1, reg1, reg2 Ld reg2,
100(1) Add reg1, reg1, reg2 Ld reg2, 100(2) Add
reg1, reg1, reg2
for (int i0 i lt 3 i) accum ai
19
Loop Rerolling Identify Unrolled Iterations

Find consecutively repeating instruction sequences

Original C Code
x x 1 for (i0 i lt 2 i)
aibi1 yx
20
Loop Rerolling Compacting Iterations
Unrolled Loop Identificiation
Original C Code
Add r3, r3, 1 Ld r0, b(0) Add r1, r0, 1 St a(0),
r1 Ld r0, b(1) Add r1, r0, 1 St a(1), r1 Mov r4,
r3
x x 1 for (i0 i lt 2 i)
aibi1 yx
21
Method Strength Promotion

Problem Compilers strength reduction (replacing
multiplies by shifts and adds) prevents synthesis
from using hard-core multipliers, sometimes
hurting circuit performance

FIR Filter
Strength-reduced multiplication
Strength-Reduced FIR Filter
22
Strength Promotion

Solution Promote strength-reduced code to muls

23
New Decompilation Methods Benefits
Speedups from Loop Rerolling

Rerolling
Speedups from better use of smart buffers
Other potential benefits faster synthesis, less
area
Strength promotion
Speedups from fewer cycles
Speedups from faster clock
New methods to be developed
e.g., pointer DS to arrays

24
Decompilation is Effective Even with High
Compiler-Optimization Levels
Average Speedup of 10 Examples
Publication New Decompilation Techniques for
Binary-level Co-processor Generation. G. Stitt,
F. Vahid. IEEE/ACM International Conference on
Computer-Aided Design (ICCAD), Nov. 2005.
25
Task Description

Warp processing background
Two seed SRC CSR grants (2002-2005) showed
feasibility
Idea Transparently move critical binary regions
from microprocessor to FPGA ? 10x perf./energy
gains or more
Task Mature warp technology
Year 1 (in progress)
Automatic high-level construct recovery from
binaries
In-depth case studies (with Freescale)
Also discovered unanticipated problem, developed
solution
Warp-tailored FPGA prototype (with Intel)
Years 2/3
Reduce memory bottleneck by using smart buffer
Investigate domain-specific-FPGA concepts (with
Freescale)
Consider desktop/server domains (with IBM)

26
Research Problem Make Synthesis from Binaries
Competitive with Synthesis from High-Level
Languages

Performed in-depth study with Freescale
H.264 video decoder
Highly-optimized proprietary code, not reference
code
Huge difference
A benefit of SRC collaboration
Research question Is synthesis from binaries
competitive on highly-optimized code?
Several-month study

MPEG 2
H.264 Better quality, or smaller files, using
more computation
27
Optimized H.264

Larger than most benchmarks
H.264 16,000 lines
Previous work 100 to several thousand lines
Highly-optimized
H.264 Many man-hours of manual optimization
10x faster than reference code used in previous
works
Different profiling results
Previous examples
90 time in several loops
H.264
90 time in 45 functions
Harder to speedup

28
C vs. Binary Synthesis on Opt. H.264

Binary partitioning competitive with source
partitioning
Speedups compared to ARM9 software
Binary 2.48, C 2.53
Decompilation recovered nearly all high-level
information needed for partitioning and synthesis
Discovered another research problem Why arent
speedups (from binary or C) closer to ideal
(0-time per fct)

29
Coding Guidelines

Are there C-coding guidelines to improve
partitioning speedups?
Orthogonal to C vs. binary question
Guidelines may help both
Examined H.264 code further
Several phone conferences with Freescale liasons,
also several email exchanges and reports

Competitive, but both could be better
Coding guidelines get closer to ideal
30
Synthesis-Oriented Coding Guidelines

Pass by value-return
Declare a local array and copy in all data needed
by a function (makes lack of aliases explicit)
Function specialization
Create function version having frequent
parameter-values as constants

Rewritten
Original
void f(int width, int height ) . . . .
for (i0 i lt width, i) for (j0 j lt
height j) . . . . . .
void f_4_4() . . . . for (i0 i lt 4,
i) for (j0 j lt 4 j) . . .
. . .
Bounds are explicit so loops are now unrollable
31
Synthesis-Oriented Coding Guidelines

Algorithmic specialization
Use parallelizable hardware algorithms when
possible
Hoisting and sinking of error checking
Keep error checking out of loops to enable
unrolling
Lookup table avoidance
Use expressions rather than lookup tables

Original
Rewritten
Comparisons can now be parallelized
int clip512 . . . void f() . . .
for (i0 i lt 10 i) vali
clipvali . . .
void f() . . . for (i0 i lt 10 i)
if (vali gt 255) vali 255 else if
(vali lt 0) vali 0 . . .
. . .
32
Synthesis-Oriented Coding Guidelines

Use explicit control flow
Replace function pointers with if statements and
static function calls

Original
Rewritten
void (funcArray) (char data) func1,
func2, . . . void f(char data) . . .
funcPointer funcArrayi (funcPointer)
(data) . . .
void f(char data) . . . if (i 0)
func1(data) else if (i1)
func2(data) . . .
33
Coding Guideline Results on H.264

Simple coding guidelines made large improvement
Rewritten software only 3 slower than original
And, binary partitioning still competitive with C
partitioning
Speedups Binary 6.55, C 6.56
Small difference caused by switch statements that
used indirect jumps

34
Studied More Benchmarks, Developed More Guidelines

Studied guidelines further on standard benchmarks
Further synthesis speedups (again, independent of
C vs. binary issue)
Publications
Hardware/Software Partitioning of Software
Binaries A Case Study of H.264 Decode. G. Stitt,
F. Vahid, G. McGregor, B. Einloth. Int. Conf. on
Hardware/Software Codesign and System Synthesis
(CODES/ISSS), 2005 (joint publication with
Freescale)
Submitted A Code Refinement Methodology for
Performance-Improved Synthesis from C. G. Stitt,
F. Vahid, W. Najjar, 2006.
More guidelines to be developed

35
Task Description

Warp processing background
Two seed SRC CSR grants (2002-2005) showed
feasibility
Idea Transparently move critical binary regions
from microprocessor to FPGA ? 10x perf./energy
gains or more
Task Mature warp technology
Year 1 (in progress)
Automatic high-level construct recovery from
binaries
In-depth case studies (with Freescale)
Also discovered unanticipated problem, developed
solution
Warp-tailored FPGA prototype (with Intel)
Years 2/3
Reduce memory bottleneck by using smart buffer
Investigate domain-specific-FPGA concepts (with
Freescale)
Consider desktop/server domains (with IBM)

36
Warp-Tailored FPGA Prototype

Developed FPGA fabric tailored to
fast/small-memory on-chip CAD
Building chip prototype with Intel
Created synthesizable VHDL models, running
through Intel shuttle tool flow
Plan to incorporate with ARM processor and other
IP on shuttle seat
Bi-weekly phone meetings with Intel engineers
since summer 2005, ongoing, scheduled tapeout
2006 Q3

37
Industrial Interactions

Freescale
Numerous phone conferences, emails, and reports,
on technical subjects
Co-authored paper (CODES/ISSS05), another
pending
Summer internship Scott Sirowy (new UCR
graduate student), summer 2005, Austin
Intel
Three visits by PI, one by graduate student Roman
Lysecky, to Intel Research in Santa Clara
PI presented at Intel System Design Symposium,
Nov. 2005
PI served on Intel Research Silicon Prototyping
Workshop panel, May 2005
Participating in Intels Research Shuttle (chip
prototype), bi-weekly phone conferences since
summer 2005 involving PI, Intel engineers, and
Roman Lysecky (now Prof. at UA)
IBM
Embarking on studies of warp processing results
on server applications
UCR group to receive Cell-based prototyping
platform (w/ Prof. Walid Najjar)
Several interactions with Xilinx also

38
Task Description Coming Up

Warp processing background
Two seed SRC CSR grants (2002-2005) showed
feasibility
Idea Transparently move critical binary regions
from microprocessor to FPGA ? 10x perf./energy
gains or more
Task Mature warp technology
Years 1/2 (in progress)
Automatic high-level construct recovery from
binaries
In-depth case studies (with Freescale)
Also discovered unanticipated problem, developed
solution
Warp-tailored FPGA prototype (with Intel)
Years 2/3 All three sub-tasks just now underway
Reduce memory bottleneck by using smart buffer
Investigate domain-specific-FPGA concepts (with
Freescale)
Consider desktop/server domains (with IBM)

39
Recent Publications

New Decompilation Techniques for Binary-level
Co-processor Generation. G. Stitt, F. Vahid.
IEEE/ACM International Conference on
Computer-Aided Design (ICCAD), 2005.
Fast Configurable-Cache Tuning with a Unified
Second-Level Cache. A. Gordon-Ross, F. Vahid, N.
Dutt. Int. Symp. on Low-Power Electronics and
Design (ISLPED), 2005.
Hardware/Software Partitioning of Software
Binaries A Case Study of H.264 Decode. G. Stitt,
F. Vahid, G. McGregor, B. Einloth. International
Conference on Hardware/Software Codesign and
System Synthesis (CODES/ISSS), 2005. (Co-authored
paper with Freescale)
Frequent Loop Detection Using Efficient
Non-Intrusive On-Chip Hardware. A. Gordon-Ross
and F. Vahid. IEEE Trans. on Computers, Special
Issue- Best of Embedded Systems,
Microarchitecture, and Compilation Techniques in
Memory of B. Ramakrishna (Bob) Rau, Oct. 2005.
A Study of the Scalability of On-Chip Routing for
Just-in-Time FPGA Compilation. R. Lysecky, F.
Vahid and S. Tan. IEEE Symposium on
Field-Programmable Custom Computing Machines
(FCCM), 2005.
A First Look at the Interplay of Code Reordering
and Configurable Caches. A. Gordon-Ross, F.
Vahid, N. Dutt. Great Lakes Symposium on VLSI
(GLSVLSI), April 2005.
A Study of the Speedups and Competitiveness of
FPGA Soft Processor Cores using Dynamic
Hardware/Software Partitioning. R. Lysecky and F.
Vahid. Design Automation and Test in Europe
(DATE), March 2005.
A Decompilation Approach to Partitioning Software
for Microprocessor/FPGA Platforms. G. Stitt and
F. Vahid. Design Automation and Test in Europe
(DATE), March 2005.