Application-Specific Customization of Microblaze Processors, and other UCR FPGA Research - PowerPoint PPT Presentation

About This Presentation
Title:

Application-Specific Customization of Microblaze Processors, and other UCR FPGA Research

Description:

Thus, for reasonable customization tool runtimes, can only synthesize 5-10 ... App-spec tree better for certain apps, but 2x runtime. ICCAD'06 David Sheldon et al ... – PowerPoint PPT presentation

Number of Views:95
Avg rating:3.0/5.0
Slides: 45
Provided by: romanl5
Learn more at: http://www.cs.ucr.edu
Category:

less

Transcript and Presenter's Notes

Title: Application-Specific Customization of Microblaze Processors, and other UCR FPGA Research


1
Application-Specific Customization of Microblaze
Processors, and other UCR FPGA Research
  • Frank Vahid
  • Professor
  • Department of Computer Science and Engineering
  • University of California, Riverside
  • Associate Director, Center for Embedded Computer
    Systems, UC Irvine
  • Work supported by the National Science
    Foundation, the Semiconductor Research
    Corporation, and Xilinx
  • Collaborators David Sheldon (4th yr UCR PhD
    student), Roman Lysecky (PhD UCR 2005, now Asst.
    Prof. at U. Arizona), Rakesh Kumar (PhD UCSD
    2006, now Asst. Prof. at UIUC), Dean Tullsen
    (Prof. at UCSD)

2
Outline
  • Two UCR ICCAD06 papers
  • Microblaze customization
  • Microblaze conjoining (and customization)
  • Current work targetting Microblaze users
  • Design of Experiments paradigm
  • System-level synthesis for multi-core systems
  • Related FPGA work
  • "Warp processing"
  • Standard binaries for FPGAs

3
Microblaze Customization (ICCAD paper 1)
  • FPGAs an increasingly popular software platform
  • FPGA soft core processor
  • Microprocessor synthesized onto FPGA fabric
  • Soft core customization
  • Cores come with configurable parameters
  • Xilinx Microblaze comes with several
    instantiatable units multiplier, barrel shifter,
    divider, FPU, or cache
  • Customization Tuning soft core parameters to a
    specific application

Micro-processor
Mul
BS
Micro-processor
Mul
Micro-processor
Mul
Div
Div
I
FPU
FPU
I
4
Instantiable Unit Speedups
  • Instantiating units can yield significant
    speedups
  • base Microblaze without any optional units
    instantiated

5
Customization Tradeoffs
Data for aifir EEMBC benchmark on Xilinx
Microblaze synthesized to Virtex device
6
Size on an FPGA
Image courtesy of Xilinx
  • Defining a circuits size on an FPGA requires
    some work
  • Different resources
  • Lookup tables (LUTs)
  • Embedded multipliers
  • Embedded block RAM (BRAM)
  • Our solution Define equivalent LUTs for
    multipliers and BRAM
  • Based on total LUTs, multipliers, and BRAMs in a
    full Microblaze
  • Later found to closely match Xilinxs equivalent
    gates concept

7
Goal Customize Soft Core to Minimize Application
Runtime
  • With and without size constraint
  • Even without size constraint, must take care
    because some units reduce clock frequency and
    thus may slow runtime

8
Goal Customize Soft Core to Minimize
Application Runtime
  • With and without size constraint
  • Even without size constraint, must take care
    because some units reduce clock frequency and
    thus may slow runtime
  • "Full MB" MB with all units instantiated

9
Key Problem Related to Core Customization
  • Problem Synthesis of one soft core
    configuration, and execution/simulation on that
    configuration, requires about 15 minutes
  • Thus, for reasonable customization tool runtimes,
    can only synthesize 5-10 configurations in search
    of best one

10
Two Solution Approaches
  • Traditional CAD approach
  • Pre-characterize using synthesis and
    execution/simulation, create abstract problem
    model, solve using CAD exploration algorithms
  • Used 0-1 knapsack formulation
  • Synthesis-in-the-loop approach
  • Run synthesis and execute/simulate application
    while exploring
  • More accurate

start
5-10 executions
Synthesis and execution/simulation
Pre-characterize
Model
Typically some form of graph
Explore
finish
5-10 interations
start
Synthesis and execution/simulation
Explore
finish
11
Traditional CAD Approach
  • Map to 0-1 knapsack problem
  • 0-1 knapsack problem
  • Given set of items, each with value and weight
  • Maximize value of items in a weight-constrained
    knapsack
  • Mapping
  • Item Instantiatable unit
  • Value of an item Speedup increment when
    instantiating the unit, vs. base MB
  • Weight of an item Equivalent LUTs
  • Knapsack weight constraint Equivalent LUTs
    constraint

Items
Mul
BS
Micro-processor
Mul
Div
I
FPU
I
Knapsack
12
Traditional CAD Approach
  • Units size determined by synthesizing MB with
    only that unit instantiated
  • Requires 5 synthesis runs, 1 for base
  • About 1 hour
  • Units speedup increment determined by comparing
    runtime of application with and without the unit

Different for every application
  • Well-known knapsack heuristic uses value/weight
    ratio, applies dynamic programming. Complexity is
    O(nW)
  • n number of items, W knapsack constraint
  • Runtime is negligible (seconds)

Different for every application
13
Problem with Traditional CAD Approach
  • Does not consider interactions among units
  • Speedup increments may not be additive for given
    application
  • e.g., Mul ? 0.4, BS ? 0.3, but Mul BS ? 0.6,
    not 0.7
  • Thus, not a perfect mapping to the 0-1 knapsack
    problem
  • Because item weights dont add perfectly

Average pairwise speedup-increment additive
inaccuracies for all pairs of benchmarks
14
Synthesis-in-the-Loop Approach
  • View solution space as tree
  • Level Instantiate given unit?
  • If gives speedup and fits, instantiate
  • Use unit speedup/size to order tree
  • Impact-ordered tree approach
  • 5 synthesis runs, 1 for base, to determine
    units speedup/size
  • Then requires 5 more synthesis runs to descend
    through tree

base
Yes
No
Barrel shifter
basebs
base
Multiplier
basebs
base
basebsmul
basemul
Cache
Divider
FPU
15
Synthesis-in-the-Loop Approach
  • View solution space as tree, each level a
    decision for a unit, order levels by unit
    speedup/size for application
  • 11 synthesis runs make take a few hours
  • To reduce, can consider using pre-determined
    order
  • Determined by soft core vendor based on averages
    over many benchmarks

base
Yes
No
basediv
base
Application-specific impact-ordered tree
16
Synthesis-in-the-Loop Approach
  • Data for fixed impact-ordered tree for 11 EEMBC
    benchmarks


17
Customization Results
  • Fixed tree approach generally best
  • App-spec tree better for certain apps, but 2x
    runtime
  • ICCAD'06 David Sheldon et al

Results are averages for 11 EEMBC benchmarks
18
Conjoined Processors (ICCAD paper 2)
Processor 1
Multiplier
Processor 2
Multiplier
  • Conjoined processors
  • Two processors sharing a hardware unit to save
    size (Kumar/Jouppi/Tullsen ISCA 2004)
  • Showed little performance overhead for desktop
    processors
  • Only research customer is Intel for soft core
    processors, research customers are every soft
    core user
  • How much size savings and performance overhead
    for conjoined Microblazes?

19
Conjoined Processors Size Savings
20
Conjoined Processors Performance Overhead
  • We created a trace simulator
  • Reads two instruction traces output by MB
    simulator
  • Adds 1-cycle delay for every access to a
    conjoined unit (pessimistic assumption about
    contention detection scheme)
  • Looks for simultaneous access of shared unit,
    stalls one MB entirely until unit becomes
    available

21
Conjoined Processors Performance Overhead
  • Data shown for benchmarks that benefit (gt1.3x
    speedup) from barrel shifter
  • Performance overheads are small

22
Performance overhead for all benchmark pairs
23
Customization Considering Conjoinment
  • Developed 0-1 knapsack approach
  • Disjunctively-Constrained Knapsack Solution to
    accomodate conjoinment

ICCAD'06 David Sheldon et al
Only 8 pairings shown due to space limits
Note To avoid exaggerating the benefits of
conjoinment, data only considers benchmark pairs
that significantly use a shared unit
24
Outline
  • Two UCR ICCAD06 papers
  • Microblaze customization
  • Microblaze conjoining (and customization)
  • Current work targetting Microblaze users
  • Design of Experiments paradigm
  • System-level synthesis for multi-core systems
  • Related FPGA work
  • "Warp processing"
  • Standard binaries for FPGAs

25
Ongoing Work Design of Experiments Paradigm
  • "Design of Experiments"
  • Well-established discipline (gt80 yrs) for tuning
    parameters
  • For factories, crops, management, etc.
  • Want to set parameter values for best output
  • But each experiment costly, so can't try all
    combinations
  • Clear mapping of soft core customization to DOE
    problem
  • Given parameters and of possible experiments
  • Generates which experiments to run (parameter
    values)
  • Analyzes resulting data
  • Sound mathematical foundations
  • Present focus of David Sheldon (4th yr Ph.D.)

26
Ongoing Work Design of Experiments Paradigm
  • Suppose time for 12 experiments
  • DOE tool generates which 12 experiments to run
  • User fills in results column

27
Ongoing Work Design of Experiments Paradigm
  • DOE tool analyzes results
  • Finds most important factors for given application

28
Ongoing Work Design of Experiments Paradigm
  • Results for a different application

29
Ongoing Work Design of Experiments Paradigm
  • Interactions among parameters also automatically
    determined

30
Ongoing work System synthesis
  • Given N applications
  • Create customized soft core for each app
  • Criteria Meet size constraint, minimize total
    applications' runtime
  • Other criteria possible (e.g., meet runtime
    constraint, minimize size)
  • Present focus of Ryan Mannion, 3rd yr Ph.D.

App1
App2
AppN
31
Ongoing work System synthesis
Graduate Student Ryan Mannion, 3rd yr Ph.D.
  • Presently use Integer Linear Program
  • Solutions for large set of Xilinx devices
    generated in seconds

32
Outline
  • Two UCR ICCAD06 papers
  • Microblaze customization
  • Microblaze conjoining (and customization)
  • Current work targetting Microblaze users
  • Design of Experiments paradigm System-level
    synthesis for multi-core systems
  • Related FPGA work
  • Warp processing
  • Standard binaries for FPGAs

33
Binary-Level Synthesis
  • Binary-level FPGA compiler developed 2002-2006
    (Greg Stitt, Ph.D. UCR 2007)

C
Java
asmM
obj
Source-level FPGA compiler provides a limited
solution
Assembler
Compiler
Compiler
FPGA Binary
Linker
Microproc. Binary
Binary-level FPGA compiler provides a more
general solution, at the expense of lost
high-level information
Binary-level FPGA compiler
FPGA Binary
Microproc. Binary
34
Binary Synthesis Competitive with Source Level
  • Aggressive decompilation recovers most high-level
    constructs needed for good synthesis Makes
    binary-level synthesis competitive with source
    level

Freescale H264 decoder example, from ISSS/CODES
2005
35
Binary Synthesis Enables Dynamic
Hardware/Software Partitioning
  • Called Warp Processing (Vahid/Stitt/Lysecky
    2003-2007)
  • Direct collaborators
  • Intel, IBM, and Freescale

36
Warp Processing Idea
1
Initially, software binary loaded into
instruction memory
Profiler
I Mem
µP
D
FPGA
On-chip CAD
37
Warp Processing Idea
2
Microprocessor executes instructions in software
binary
Profiler
I Mem
µP
D
FPGA
On-chip CAD
38
Warp Processing Idea
3
Profiler monitors instructions and detects
critical regions in binary
Profiler
Profiler
I Mem
µP
µP
beq
beq
beq
beq
beq
beq
beq
beq
beq
beq
add
add
add
add
add
add
add
add
add
add
D
FPGA
On-chip CAD
39
Warp Processing Idea
4
On-chip CAD reads in critical region
Profiler
Profiler
I Mem
µP
µP
D
FPGA
On-chip CAD
On-chip CAD
40
Warp Processing Idea
5
On-chip CAD decompiles critical region into
control data flow graph (CDFG)
Profiler
Profiler
I Mem
µP
µP
D
FPGA
Dynamic Part. Module (DPM)
On-chip CAD
41
Warp Processing Idea
6
On-chip CAD synthesizes decompiled CDFG to a
custom (parallel) circuit
Profiler
Profiler
I Mem
µP
µP
D
FPGA
Dynamic Part. Module (DPM)
On-chip CAD
42
Warp Processing Idea
7
On-chip CAD maps circuit onto FPGA
Profiler
Profiler
I Mem
µP
µP
D
FPGA
FPGA
Dynamic Part. Module (DPM)
On-chip CAD


43
Warp Processing Idea
On-chip CAD replaces instructions in binary to
use hardware, causing performance and energy to
warp by an order of magnitude or more
8
Mov reg3, 0 Mov reg4, 0 loop // instructions
that interact with FPGA Ret reg4
Profiler
Profiler
I Mem
µP
µP
D
FPGA
FPGA
Dynamic Part. Module (DPM)
On-chip CAD


DAC'03, DAC'04, DATE'04, ISSS/CODES'04, FPGA'04,
DATE'05, FCCM'05, ICCAD'05, ISSS/CODES'05,
TECS'06, U.S. Patent Pending
44
Warp Processing Challenges
  • Two key challenges
  • Can we decompile binaries to recover enough
    high-level constructs to create fast circuits on
    FPGAs? (G. Stitt)
  • Can we just-in-time (JIT) compile to FPGAs using
    limited on-chip compute resources? (R. Lysecky)

45
Warp ProcessorsPerformance Speedup (Most
Frequent Kernel Only)
SW Only Execution
46
Warp ProcessorsPerformance Speedup (Overall,
Multiple Kernels)
Assuming 100 MHz ARM, and fabric clocked at rate
determined by synthesis
  • Energy reduction of 38 - 94

SW Only Execution
47
Warp ProcessorsSpeedups Compared with Digital
Signal Processor
48
Warp ProcessorsSpeedups for Multi-Threaded
Application Benchmarks
  • Compelling computing advantage of FPGAs
  • Parallellism from bit level up to processor
    level, and everywhere in between

49
FPGA Ubiquity via Obscurity
  • Warp processing hides FPGA from languages and
    tools
  • ANY microprocessor platform extendible with FPGA
  • Maintains "ecosystem" application, tool, and
    architecture developers
  • New platforms with FPGAs appearing

Profiling
Standard Compiler
New processor platforms with FPGA evolving
50
FPGA Standard Binaries?
  • Microprocessor binary represents one form of a
    "standard binary for FPGAs"
  • Missing is explicit concurrency
  • Parallelism, pipelining, queues, etc.
  • As FPGAs appear in more platforms, might a more
    general FPGA binary evolve?

Profiling
Standard Compiler
Architectures
Standard binaries
Standard FPGA binaries
Applications
Tools
51
FPGA Standard Binaries?
  • Translator would make best use of existing FPGA
    resources
  • Could even add FPGA, like adding memory, to
    improve performance
  • Add more FPGA to your PDA to implement
    compute-intensive application?

52
FPGA Standard Binaries
  • NSF funding received for 2006-2009
  • Xilinx letter of support was helpful

Graduate Student Scott Sirowy, 2nd year Ph.D.
53
Future Work Standard Binary
High-level behavior
Desktop tool and/or human effort
OR
Temporally-oriented binary
Spatially-oriented binary


Mul r1, r2, r3 Mul r4, r5, r6 Add r7, r1, r4

1001 001 010 011 1001 100 101 110 1000 111 001 100
000 111 001 010 011 001 111 100 101 110 010 110
011 110 111
Binaries
54
Future Work Standard Binary
55
Future Work Standard Binary
56
Future Work Standard Binary
57
Conclusions
  • Soft core customization increasingly important to
    make best use of limited FPGA resources
  • Good initial automatic customization results
  • Design of Experiments paradigm looks promising
  • System-level synthesis may yield very useful MB
    user tool, perhaps web based
  • Warp processing and standard FPGA binary work can
    help make FPGAs ubiquitous
  • Accomplishments made possible by Xilinx donations
    and interactions
  • Continued and close collaboration sought
Write a Comment
User Comments (0)
About PowerShow.com