Title: Application-Specific Customization of Microblaze Processors, and other UCR FPGA Research
1Application-Specific Customization of Microblaze
Processors, and other UCR FPGA Research
- Frank Vahid
- Professor
- Department of Computer Science and Engineering
- University of California, Riverside
- Associate Director, Center for Embedded Computer
Systems, UC Irvine - Work supported by the National Science
Foundation, the Semiconductor Research
Corporation, and Xilinx - Collaborators David Sheldon (4th yr UCR PhD
student), Roman Lysecky (PhD UCR 2005, now Asst.
Prof. at U. Arizona), Rakesh Kumar (PhD UCSD
2006, now Asst. Prof. at UIUC), Dean Tullsen
(Prof. at UCSD)
2Outline
- Two UCR ICCAD06 papers
- Microblaze customization
- Microblaze conjoining (and customization)
- Current work targetting Microblaze users
- Design of Experiments paradigm
- System-level synthesis for multi-core systems
- Related FPGA work
- "Warp processing"
- Standard binaries for FPGAs
3Microblaze Customization (ICCAD paper 1)
- FPGAs an increasingly popular software platform
- FPGA soft core processor
- Microprocessor synthesized onto FPGA fabric
- Soft core customization
- Cores come with configurable parameters
- Xilinx Microblaze comes with several
instantiatable units multiplier, barrel shifter,
divider, FPU, or cache - Customization Tuning soft core parameters to a
specific application
Micro-processor
Mul
BS
Micro-processor
Mul
Micro-processor
Mul
Div
Div
I
FPU
FPU
I
4Instantiable Unit Speedups
- Instantiating units can yield significant
speedups - base Microblaze without any optional units
instantiated
5Customization Tradeoffs
Data for aifir EEMBC benchmark on Xilinx
Microblaze synthesized to Virtex device
6Size on an FPGA
Image courtesy of Xilinx
- Defining a circuits size on an FPGA requires
some work - Different resources
- Lookup tables (LUTs)
- Embedded multipliers
- Embedded block RAM (BRAM)
- Our solution Define equivalent LUTs for
multipliers and BRAM - Based on total LUTs, multipliers, and BRAMs in a
full Microblaze - Later found to closely match Xilinxs equivalent
gates concept
7Goal Customize Soft Core to Minimize Application
Runtime
- With and without size constraint
- Even without size constraint, must take care
because some units reduce clock frequency and
thus may slow runtime
8Goal Customize Soft Core to Minimize
Application Runtime
- With and without size constraint
- Even without size constraint, must take care
because some units reduce clock frequency and
thus may slow runtime - "Full MB" MB with all units instantiated
9Key Problem Related to Core Customization
- Problem Synthesis of one soft core
configuration, and execution/simulation on that
configuration, requires about 15 minutes - Thus, for reasonable customization tool runtimes,
can only synthesize 5-10 configurations in search
of best one
10Two Solution Approaches
- Traditional CAD approach
- Pre-characterize using synthesis and
execution/simulation, create abstract problem
model, solve using CAD exploration algorithms - Used 0-1 knapsack formulation
- Synthesis-in-the-loop approach
- Run synthesis and execute/simulate application
while exploring - More accurate
start
5-10 executions
Synthesis and execution/simulation
Pre-characterize
Model
Typically some form of graph
Explore
finish
5-10 interations
start
Synthesis and execution/simulation
Explore
finish
11Traditional CAD Approach
- Map to 0-1 knapsack problem
- 0-1 knapsack problem
- Given set of items, each with value and weight
- Maximize value of items in a weight-constrained
knapsack - Mapping
- Item Instantiatable unit
- Value of an item Speedup increment when
instantiating the unit, vs. base MB - Weight of an item Equivalent LUTs
- Knapsack weight constraint Equivalent LUTs
constraint
Items
Mul
BS
Micro-processor
Mul
Div
I
FPU
I
Knapsack
12Traditional CAD Approach
- Units size determined by synthesizing MB with
only that unit instantiated - Requires 5 synthesis runs, 1 for base
- About 1 hour
- Units speedup increment determined by comparing
runtime of application with and without the unit
Different for every application
- Well-known knapsack heuristic uses value/weight
ratio, applies dynamic programming. Complexity is
O(nW) - n number of items, W knapsack constraint
- Runtime is negligible (seconds)
Different for every application
13Problem with Traditional CAD Approach
- Does not consider interactions among units
- Speedup increments may not be additive for given
application - e.g., Mul ? 0.4, BS ? 0.3, but Mul BS ? 0.6,
not 0.7 - Thus, not a perfect mapping to the 0-1 knapsack
problem - Because item weights dont add perfectly
Average pairwise speedup-increment additive
inaccuracies for all pairs of benchmarks
14Synthesis-in-the-Loop Approach
- View solution space as tree
- Level Instantiate given unit?
- If gives speedup and fits, instantiate
- Use unit speedup/size to order tree
- Impact-ordered tree approach
- 5 synthesis runs, 1 for base, to determine
units speedup/size - Then requires 5 more synthesis runs to descend
through tree
base
Yes
No
Barrel shifter
basebs
base
Multiplier
basebs
base
basebsmul
basemul
Cache
Divider
FPU
15Synthesis-in-the-Loop Approach
- View solution space as tree, each level a
decision for a unit, order levels by unit
speedup/size for application - 11 synthesis runs make take a few hours
- To reduce, can consider using pre-determined
order - Determined by soft core vendor based on averages
over many benchmarks
base
Yes
No
basediv
base
Application-specific impact-ordered tree
16Synthesis-in-the-Loop Approach
- Data for fixed impact-ordered tree for 11 EEMBC
benchmarks
17Customization Results
- Fixed tree approach generally best
- App-spec tree better for certain apps, but 2x
runtime - ICCAD'06 David Sheldon et al
Results are averages for 11 EEMBC benchmarks
18Conjoined Processors (ICCAD paper 2)
Processor 1
Multiplier
Processor 2
Multiplier
- Conjoined processors
- Two processors sharing a hardware unit to save
size (Kumar/Jouppi/Tullsen ISCA 2004) - Showed little performance overhead for desktop
processors - Only research customer is Intel for soft core
processors, research customers are every soft
core user - How much size savings and performance overhead
for conjoined Microblazes?
19Conjoined Processors Size Savings
20Conjoined Processors Performance Overhead
- We created a trace simulator
- Reads two instruction traces output by MB
simulator - Adds 1-cycle delay for every access to a
conjoined unit (pessimistic assumption about
contention detection scheme) - Looks for simultaneous access of shared unit,
stalls one MB entirely until unit becomes
available
21Conjoined Processors Performance Overhead
- Data shown for benchmarks that benefit (gt1.3x
speedup) from barrel shifter - Performance overheads are small
22Performance overhead for all benchmark pairs
23Customization Considering Conjoinment
- Developed 0-1 knapsack approach
- Disjunctively-Constrained Knapsack Solution to
accomodate conjoinment
ICCAD'06 David Sheldon et al
Only 8 pairings shown due to space limits
Note To avoid exaggerating the benefits of
conjoinment, data only considers benchmark pairs
that significantly use a shared unit
24Outline
- Two UCR ICCAD06 papers
- Microblaze customization
- Microblaze conjoining (and customization)
- Current work targetting Microblaze users
- Design of Experiments paradigm
- System-level synthesis for multi-core systems
- Related FPGA work
- "Warp processing"
- Standard binaries for FPGAs
25Ongoing Work Design of Experiments Paradigm
- "Design of Experiments"
- Well-established discipline (gt80 yrs) for tuning
parameters - For factories, crops, management, etc.
- Want to set parameter values for best output
- But each experiment costly, so can't try all
combinations - Clear mapping of soft core customization to DOE
problem - Given parameters and of possible experiments
- Generates which experiments to run (parameter
values) - Analyzes resulting data
- Sound mathematical foundations
- Present focus of David Sheldon (4th yr Ph.D.)
26Ongoing Work Design of Experiments Paradigm
- Suppose time for 12 experiments
- DOE tool generates which 12 experiments to run
- User fills in results column
27Ongoing Work Design of Experiments Paradigm
- DOE tool analyzes results
- Finds most important factors for given application
28Ongoing Work Design of Experiments Paradigm
- Results for a different application
29Ongoing Work Design of Experiments Paradigm
- Interactions among parameters also automatically
determined
30Ongoing work System synthesis
- Given N applications
- Create customized soft core for each app
- Criteria Meet size constraint, minimize total
applications' runtime - Other criteria possible (e.g., meet runtime
constraint, minimize size) - Present focus of Ryan Mannion, 3rd yr Ph.D.
App1
App2
AppN
31Ongoing work System synthesis
Graduate Student Ryan Mannion, 3rd yr Ph.D.
- Presently use Integer Linear Program
- Solutions for large set of Xilinx devices
generated in seconds
32Outline
- Two UCR ICCAD06 papers
- Microblaze customization
- Microblaze conjoining (and customization)
- Current work targetting Microblaze users
- Design of Experiments paradigm System-level
synthesis for multi-core systems - Related FPGA work
- Warp processing
- Standard binaries for FPGAs
33Binary-Level Synthesis
- Binary-level FPGA compiler developed 2002-2006
(Greg Stitt, Ph.D. UCR 2007)
C
Java
asmM
obj
Source-level FPGA compiler provides a limited
solution
Assembler
Compiler
Compiler
FPGA Binary
Linker
Microproc. Binary
Binary-level FPGA compiler provides a more
general solution, at the expense of lost
high-level information
Binary-level FPGA compiler
FPGA Binary
Microproc. Binary
34Binary Synthesis Competitive with Source Level
- Aggressive decompilation recovers most high-level
constructs needed for good synthesis Makes
binary-level synthesis competitive with source
level
Freescale H264 decoder example, from ISSS/CODES
2005
35Binary Synthesis Enables Dynamic
Hardware/Software Partitioning
- Called Warp Processing (Vahid/Stitt/Lysecky
2003-2007) - Direct collaborators
- Intel, IBM, and Freescale
36Warp Processing Idea
1
Initially, software binary loaded into
instruction memory
Profiler
I Mem
µP
D
FPGA
On-chip CAD
37Warp Processing Idea
2
Microprocessor executes instructions in software
binary
Profiler
I Mem
µP
D
FPGA
On-chip CAD
38Warp Processing Idea
3
Profiler monitors instructions and detects
critical regions in binary
Profiler
Profiler
I Mem
µP
µP
beq
beq
beq
beq
beq
beq
beq
beq
beq
beq
add
add
add
add
add
add
add
add
add
add
D
FPGA
On-chip CAD
39Warp Processing Idea
4
On-chip CAD reads in critical region
Profiler
Profiler
I Mem
µP
µP
D
FPGA
On-chip CAD
On-chip CAD
40Warp Processing Idea
5
On-chip CAD decompiles critical region into
control data flow graph (CDFG)
Profiler
Profiler
I Mem
µP
µP
D
FPGA
Dynamic Part. Module (DPM)
On-chip CAD
41Warp Processing Idea
6
On-chip CAD synthesizes decompiled CDFG to a
custom (parallel) circuit
Profiler
Profiler
I Mem
µP
µP
D
FPGA
Dynamic Part. Module (DPM)
On-chip CAD
42Warp Processing Idea
7
On-chip CAD maps circuit onto FPGA
Profiler
Profiler
I Mem
µP
µP
D
FPGA
FPGA
Dynamic Part. Module (DPM)
On-chip CAD
43Warp Processing Idea
On-chip CAD replaces instructions in binary to
use hardware, causing performance and energy to
warp by an order of magnitude or more
8
Mov reg3, 0 Mov reg4, 0 loop // instructions
that interact with FPGA Ret reg4
Profiler
Profiler
I Mem
µP
µP
D
FPGA
FPGA
Dynamic Part. Module (DPM)
On-chip CAD
DAC'03, DAC'04, DATE'04, ISSS/CODES'04, FPGA'04,
DATE'05, FCCM'05, ICCAD'05, ISSS/CODES'05,
TECS'06, U.S. Patent Pending
44Warp Processing Challenges
- Two key challenges
- Can we decompile binaries to recover enough
high-level constructs to create fast circuits on
FPGAs? (G. Stitt) - Can we just-in-time (JIT) compile to FPGAs using
limited on-chip compute resources? (R. Lysecky)
45Warp ProcessorsPerformance Speedup (Most
Frequent Kernel Only)
SW Only Execution
46Warp ProcessorsPerformance Speedup (Overall,
Multiple Kernels)
Assuming 100 MHz ARM, and fabric clocked at rate
determined by synthesis
- Energy reduction of 38 - 94
SW Only Execution
47Warp ProcessorsSpeedups Compared with Digital
Signal Processor
48Warp ProcessorsSpeedups for Multi-Threaded
Application Benchmarks
- Compelling computing advantage of FPGAs
- Parallellism from bit level up to processor
level, and everywhere in between
49FPGA Ubiquity via Obscurity
- Warp processing hides FPGA from languages and
tools - ANY microprocessor platform extendible with FPGA
- Maintains "ecosystem" application, tool, and
architecture developers - New platforms with FPGAs appearing
Profiling
Standard Compiler
New processor platforms with FPGA evolving
50FPGA Standard Binaries?
- Microprocessor binary represents one form of a
"standard binary for FPGAs" - Missing is explicit concurrency
- Parallelism, pipelining, queues, etc.
- As FPGAs appear in more platforms, might a more
general FPGA binary evolve?
Profiling
Standard Compiler
Architectures
Standard binaries
Standard FPGA binaries
Applications
Tools
51FPGA Standard Binaries?
- Translator would make best use of existing FPGA
resources - Could even add FPGA, like adding memory, to
improve performance - Add more FPGA to your PDA to implement
compute-intensive application?
52FPGA Standard Binaries
- NSF funding received for 2006-2009
- Xilinx letter of support was helpful
Graduate Student Scott Sirowy, 2nd year Ph.D.
53Future Work Standard Binary
High-level behavior
Desktop tool and/or human effort
OR
Temporally-oriented binary
Spatially-oriented binary
Mul r1, r2, r3 Mul r4, r5, r6 Add r7, r1, r4
1001 001 010 011 1001 100 101 110 1000 111 001 100
000 111 001 010 011 001 111 100 101 110 010 110
011 110 111
Binaries
54Future Work Standard Binary
55Future Work Standard Binary
56Future Work Standard Binary
57Conclusions
- Soft core customization increasingly important to
make best use of limited FPGA resources - Good initial automatic customization results
- Design of Experiments paradigm looks promising
- System-level synthesis may yield very useful MB
user tool, perhaps web based - Warp processing and standard FPGA binary work can
help make FPGAs ubiquitous - Accomplishments made possible by Xilinx donations
and interactions - Continued and close collaboration sought