Application-Specific Customization of Microblaze Processors, and other UCR FPGA Research - PowerPoint PPT Presentation

About This Presentation

Title:

Application-Specific Customization of Microblaze Processors, and other UCR FPGA Research

Description:

Thus, for reasonable customization tool runtimes, can only synthesize 5-10 ... App-spec tree better for certain apps, but 2x runtime. ICCAD'06 David Sheldon et al ... – PowerPoint PPT presentation

Number of Views:95

Avg rating:3.0/5.0

Slides: 45

Provided by: romanl5

Learn more at: http://www.cs.ucr.edu

Category:

more less

Transcript and Presenter's Notes

Title: Application-Specific Customization of Microblaze Processors, and other UCR FPGA Research

1
Application-Specific Customization of Microblaze
Processors, and other UCR FPGA Research

Frank Vahid
Professor
Department of Computer Science and Engineering
University of California, Riverside
Associate Director, Center for Embedded Computer
Systems, UC Irvine
Work supported by the National Science
Foundation, the Semiconductor Research
Corporation, and Xilinx
Collaborators David Sheldon (4th yr UCR PhD
student), Roman Lysecky (PhD UCR 2005, now Asst.
Prof. at U. Arizona), Rakesh Kumar (PhD UCSD
2006, now Asst. Prof. at UIUC), Dean Tullsen
(Prof. at UCSD)

2
Outline

Two UCR ICCAD06 papers
Microblaze customization
Microblaze conjoining (and customization)
Current work targetting Microblaze users
Design of Experiments paradigm
System-level synthesis for multi-core systems
Related FPGA work
"Warp processing"
Standard binaries for FPGAs

3
Microblaze Customization (ICCAD paper 1)

FPGAs an increasingly popular software platform
FPGA soft core processor
Microprocessor synthesized onto FPGA fabric
Soft core customization
Cores come with configurable parameters
Xilinx Microblaze comes with several
instantiatable units multiplier, barrel shifter,
divider, FPU, or cache
Customization Tuning soft core parameters to a
specific application

Micro-processor
Mul
BS
Micro-processor
Mul
Micro-processor
Mul
Div
Div
I
FPU
FPU
I
4
Instantiable Unit Speedups

Instantiating units can yield significant
speedups
base Microblaze without any optional units
instantiated

5
Customization Tradeoffs
Data for aifir EEMBC benchmark on Xilinx
Microblaze synthesized to Virtex device
6
Size on an FPGA
Image courtesy of Xilinx

Defining a circuits size on an FPGA requires
some work
Different resources
Lookup tables (LUTs)
Embedded multipliers
Embedded block RAM (BRAM)
Our solution Define equivalent LUTs for
multipliers and BRAM
Based on total LUTs, multipliers, and BRAMs in a
full Microblaze
Later found to closely match Xilinxs equivalent
gates concept

7
Goal Customize Soft Core to Minimize Application
Runtime

With and without size constraint
Even without size constraint, must take care
because some units reduce clock frequency and
thus may slow runtime

8
Goal Customize Soft Core to Minimize
Application Runtime

With and without size constraint
Even without size constraint, must take care
because some units reduce clock frequency and
thus may slow runtime
"Full MB" MB with all units instantiated

9
Key Problem Related to Core Customization

Problem Synthesis of one soft core
configuration, and execution/simulation on that
configuration, requires about 15 minutes
Thus, for reasonable customization tool runtimes,
can only synthesize 5-10 configurations in search
of best one

10
Two Solution Approaches

Traditional CAD approach
Pre-characterize using synthesis and
execution/simulation, create abstract problem
model, solve using CAD exploration algorithms
Used 0-1 knapsack formulation
Synthesis-in-the-loop approach
Run synthesis and execute/simulate application
while exploring
More accurate

start
5-10 executions
Synthesis and execution/simulation
Pre-characterize
Model
Typically some form of graph
Explore
finish
5-10 interations
start
Synthesis and execution/simulation
Explore
finish
11
Traditional CAD Approach

Map to 0-1 knapsack problem
0-1 knapsack problem
Given set of items, each with value and weight
Maximize value of items in a weight-constrained
knapsack
Mapping
Item Instantiatable unit
Value of an item Speedup increment when
instantiating the unit, vs. base MB
Weight of an item Equivalent LUTs
Knapsack weight constraint Equivalent LUTs
constraint

Items
Mul
BS
Micro-processor
Mul
Div
I
FPU
I
Knapsack
12
Traditional CAD Approach

Units size determined by synthesizing MB with
only that unit instantiated
Requires 5 synthesis runs, 1 for base
About 1 hour

Units speedup increment determined by comparing
runtime of application with and without the unit

Different for every application

Well-known knapsack heuristic uses value/weight
ratio, applies dynamic programming. Complexity is
O(nW)
n number of items, W knapsack constraint
Runtime is negligible (seconds)

Different for every application
13
Problem with Traditional CAD Approach

Does not consider interactions among units
Speedup increments may not be additive for given
application
e.g., Mul ? 0.4, BS ? 0.3, but Mul BS ? 0.6,
not 0.7
Thus, not a perfect mapping to the 0-1 knapsack
problem
Because item weights dont add perfectly

Average pairwise speedup-increment additive
inaccuracies for all pairs of benchmarks
14
Synthesis-in-the-Loop Approach

View solution space as tree
Level Instantiate given unit?
If gives speedup and fits, instantiate
Use unit speedup/size to order tree
Impact-ordered tree approach
5 synthesis runs, 1 for base, to determine
units speedup/size
Then requires 5 more synthesis runs to descend
through tree

base
Yes
No
Barrel shifter
basebs
base
Multiplier
basebs
base
basebsmul
basemul
Cache
Divider
FPU
15
Synthesis-in-the-Loop Approach

View solution space as tree, each level a
decision for a unit, order levels by unit
speedup/size for application
11 synthesis runs make take a few hours
To reduce, can consider using pre-determined
order
Determined by soft core vendor based on averages
over many benchmarks

base
Yes
No
basediv
base
Application-specific impact-ordered tree
16
Synthesis-in-the-Loop Approach

Data for fixed impact-ordered tree for 11 EEMBC
benchmarks

17
Customization Results

Fixed tree approach generally best
App-spec tree better for certain apps, but 2x
runtime
ICCAD'06 David Sheldon et al

Results are averages for 11 EEMBC benchmarks
18
Conjoined Processors (ICCAD paper 2)
Processor 1
Multiplier
Processor 2
Multiplier

Conjoined processors
Two processors sharing a hardware unit to save
size (Kumar/Jouppi/Tullsen ISCA 2004)
Showed little performance overhead for desktop
processors
Only research customer is Intel for soft core
processors, research customers are every soft
core user
How much size savings and performance overhead
for conjoined Microblazes?

19
Conjoined Processors Size Savings
20
Conjoined Processors Performance Overhead

We created a trace simulator
Reads two instruction traces output by MB
simulator
Adds 1-cycle delay for every access to a
conjoined unit (pessimistic assumption about
contention detection scheme)
Looks for simultaneous access of shared unit,
stalls one MB entirely until unit becomes
available

21
Conjoined Processors Performance Overhead

Data shown for benchmarks that benefit (gt1.3x
speedup) from barrel shifter
Performance overheads are small

22
Performance overhead for all benchmark pairs
23
Customization Considering Conjoinment

Developed 0-1 knapsack approach
Disjunctively-Constrained Knapsack Solution to
accomodate conjoinment

ICCAD'06 David Sheldon et al
Only 8 pairings shown due to space limits
Note To avoid exaggerating the benefits of
conjoinment, data only considers benchmark pairs
that significantly use a shared unit
24
Outline

Two UCR ICCAD06 papers
Microblaze customization
Microblaze conjoining (and customization)
Current work targetting Microblaze users
Design of Experiments paradigm
System-level synthesis for multi-core systems
Related FPGA work
"Warp processing"
Standard binaries for FPGAs

25
Ongoing Work Design of Experiments Paradigm

"Design of Experiments"
Well-established discipline (gt80 yrs) for tuning
parameters
For factories, crops, management, etc.
Want to set parameter values for best output
But each experiment costly, so can't try all
combinations
Clear mapping of soft core customization to DOE
problem
Given parameters and of possible experiments
Generates which experiments to run (parameter
values)
Analyzes resulting data
Sound mathematical foundations
Present focus of David Sheldon (4th yr Ph.D.)

26
Ongoing Work Design of Experiments Paradigm

Suppose time for 12 experiments
DOE tool generates which 12 experiments to run
User fills in results column

27
Ongoing Work Design of Experiments Paradigm

DOE tool analyzes results
Finds most important factors for given application

28
Ongoing Work Design of Experiments Paradigm

Results for a different application

29
Ongoing Work Design of Experiments Paradigm

Interactions among parameters also automatically
determined

30
Ongoing work System synthesis

Given N applications
Create customized soft core for each app
Criteria Meet size constraint, minimize total
applications' runtime
Other criteria possible (e.g., meet runtime
constraint, minimize size)
Present focus of Ryan Mannion, 3rd yr Ph.D.

App1
App2
AppN
31
Ongoing work System synthesis
Graduate Student Ryan Mannion, 3rd yr Ph.D.

Presently use Integer Linear Program
Solutions for large set of Xilinx devices
generated in seconds

32
Outline

Two UCR ICCAD06 papers
Microblaze customization
Microblaze conjoining (and customization)
Current work targetting Microblaze users
Design of Experiments paradigm System-level
synthesis for multi-core systems
Related FPGA work
Warp processing
Standard binaries for FPGAs

33
Binary-Level Synthesis

Binary-level FPGA compiler developed 2002-2006
(Greg Stitt, Ph.D. UCR 2007)

C
Java
asmM
obj
Source-level FPGA compiler provides a limited
solution
Assembler
Compiler
Compiler
FPGA Binary
Linker
Microproc. Binary
Binary-level FPGA compiler provides a more
general solution, at the expense of lost
high-level information
Binary-level FPGA compiler
FPGA Binary
Microproc. Binary
34
Binary Synthesis Competitive with Source Level

Aggressive decompilation recovers most high-level
constructs needed for good synthesis Makes
binary-level synthesis competitive with source
level

Freescale H264 decoder example, from ISSS/CODES
2005
35
Binary Synthesis Enables Dynamic
Hardware/Software Partitioning

Called Warp Processing (Vahid/Stitt/Lysecky
2003-2007)
Direct collaborators
Intel, IBM, and Freescale

36
Warp Processing Idea
1
Initially, software binary loaded into
instruction memory
Profiler
I Mem
µP
D
FPGA
On-chip CAD
37
Warp Processing Idea
2
Microprocessor executes instructions in software
binary
Profiler
I Mem
µP
D
FPGA
On-chip CAD
38
Warp Processing Idea
3
Profiler monitors instructions and detects
critical regions in binary
Profiler
Profiler
I Mem
µP
µP
beq
beq
beq
beq
beq
beq
beq
beq
beq
beq
add
add
add
add
add
add
add
add
add
add
D
FPGA
On-chip CAD
39
Warp Processing Idea
4
On-chip CAD reads in critical region
Profiler
Profiler
I Mem
µP
µP
D
FPGA
On-chip CAD
On-chip CAD
40
Warp Processing Idea
5
On-chip CAD decompiles critical region into
control data flow graph (CDFG)
Profiler
Profiler
I Mem
µP
µP
D
FPGA
Dynamic Part. Module (DPM)
On-chip CAD
41
Warp Processing Idea
6
On-chip CAD synthesizes decompiled CDFG to a
custom (parallel) circuit
Profiler
Profiler
I Mem
µP
µP
D
FPGA
Dynamic Part. Module (DPM)
On-chip CAD
42
Warp Processing Idea
7
On-chip CAD maps circuit onto FPGA
Profiler
Profiler
I Mem
µP
µP
D
FPGA
FPGA
Dynamic Part. Module (DPM)
On-chip CAD

43
Warp Processing Idea
On-chip CAD replaces instructions in binary to
use hardware, causing performance and energy to
warp by an order of magnitude or more
8
Mov reg3, 0 Mov reg4, 0 loop // instructions
that interact with FPGA Ret reg4
Profiler
Profiler
I Mem
µP
µP
D
FPGA
FPGA
Dynamic Part. Module (DPM)
On-chip CAD

DAC'03, DAC'04, DATE'04, ISSS/CODES'04, FPGA'04,
DATE'05, FCCM'05, ICCAD'05, ISSS/CODES'05,
TECS'06, U.S. Patent Pending
44
Warp Processing Challenges

Two key challenges
Can we decompile binaries to recover enough
high-level constructs to create fast circuits on
FPGAs? (G. Stitt)
Can we just-in-time (JIT) compile to FPGAs using
limited on-chip compute resources? (R. Lysecky)

45
Warp ProcessorsPerformance Speedup (Most
Frequent Kernel Only)
SW Only Execution
46
Warp ProcessorsPerformance Speedup (Overall,
Multiple Kernels)
Assuming 100 MHz ARM, and fabric clocked at rate
determined by synthesis

Energy reduction of 38 - 94

SW Only Execution
47
Warp ProcessorsSpeedups Compared with Digital
Signal Processor
48
Warp ProcessorsSpeedups for Multi-Threaded
Application Benchmarks

Compelling computing advantage of FPGAs
Parallellism from bit level up to processor
level, and everywhere in between

49
FPGA Ubiquity via Obscurity

Warp processing hides FPGA from languages and
tools
ANY microprocessor platform extendible with FPGA
Maintains "ecosystem" application, tool, and
architecture developers
New platforms with FPGAs appearing

Profiling
Standard Compiler
New processor platforms with FPGA evolving
50
FPGA Standard Binaries?

Microprocessor binary represents one form of a
"standard binary for FPGAs"
Missing is explicit concurrency
Parallelism, pipelining, queues, etc.
As FPGAs appear in more platforms, might a more
general FPGA binary evolve?

Profiling
Standard Compiler
Architectures
Standard binaries
Standard FPGA binaries
Applications
Tools
51
FPGA Standard Binaries?

Translator would make best use of existing FPGA
resources
Could even add FPGA, like adding memory, to
improve performance
Add more FPGA to your PDA to implement
compute-intensive application?

52
FPGA Standard Binaries

NSF funding received for 2006-2009
Xilinx letter of support was helpful

Graduate Student Scott Sirowy, 2nd year Ph.D.
53
Future Work Standard Binary
High-level behavior
Desktop tool and/or human effort
OR
Temporally-oriented binary
Spatially-oriented binary

Mul r1, r2, r3 Mul r4, r5, r6 Add r7, r1, r4

1001 001 010 011 1001 100 101 110 1000 111 001 100
000 111 001 010 011 001 111 100 101 110 010 110
011 110 111
Binaries
54
Future Work Standard Binary
55
Future Work Standard Binary
56
Future Work Standard Binary
57
Conclusions

Soft core customization increasingly important to
make best use of limited FPGA resources
Good initial automatic customization results
Design of Experiments paradigm looks promising
System-level synthesis may yield very useful MB
user tool, perhaps web based
Warp processing and standard FPGA binary work can
help make FPGAs ubiquitous
Accomplishments made possible by Xilinx donations
and interactions
Continued and close collaboration sought