A Configurable Logic Architecture for Dynamic Hardware/Software Partitioning - PowerPoint PPT Presentation

View by Category
About This Presentation
Title:

A Configurable Logic Architecture for Dynamic Hardware/Software Partitioning

Description:

A Configurable Logic Architecture for Dynamic Hardware/Software Partitioning Roman Lysecky, Frank Vahid* Department of Computer Science and Engineering – PowerPoint PPT presentation

Number of Views:47
Avg rating:3.0/5.0
Slides: 22
Provided by: inac9
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: A Configurable Logic Architecture for Dynamic Hardware/Software Partitioning


1
A Configurable Logic Architecture for Dynamic
Hardware/Software Partitioning
  • Roman Lysecky, Frank Vahid
  • Department of Computer Science and Engineering
  • University of California, Riverside
  • Also with the Center for Embedded Computer
    Systems at UC Irvine
  • This work was supported in part by the National
    Science Foundation, the Semiconductor Research
    Corporation, and a Department of Education GAANN
    fellowship

2
IntroductionDynamic Software Optimization
  • Dynamic optimizations are increasingly common
  • Dynamo - Dynamic software optimizations
  • Transmeta Crusoe, Efficeon - Dynamic code
    morphing
  • Just In Time (JIT) Compilation - Interpreted
    languages
  • Advantages
  • Transparent optimizations
  • No designer effort
  • No tool restrictions
  • Adapts to actual usage
  • Drawbacks
  • Currently limited to software optimizations
  • Limited speedup (1.1x to 1.3x common)

3
IntroductionHardware/Software Partitioning
  • Benefits
  • Speedups of 2X to 10X typical
  • Speedups of 800X possible
  • Far more potential than dynamic SW optimizations
    (1.2x)
  • Energy reductions of 25 to 95 typical

SW ______ ______ ______
SW ______ ______ ______
4
IntroductionTraditional Hardware/Software
Partitioning
  • Requires specialized CAD tools
  • Non-standard partitioning compilers

5
IntroductionBinary Hardware/Software Partitioning
  • Binary Partitioning Stitt/Vahid ICCAD02
  • Partition application starting from SW binary
  • Can be desktop based
  • Advantages
  • Use any standard compiler
  • Supports any language
  • Supports multiple sources from multiple languages
  • Supports assembly/object code
  • Supports legacy code
  • Disadvantage
  • Loses some high-level information, so may be some
    loss of quality

6
IntroductionDynamic Hardware/Software
Partitioning
  • Dynamic HW/SW Partitioning
  • Embed HW/SW partitioning CAD tools on-chip
  • Feasible in era of billion-transistor chips
  • Advantages
  • Does not require any special compilers
  • Completely transparent
  • Bring benefits of HW/SW partitioning to all SW
    designers
  • Complements other approaches
  • Desktop CAD best from purely technical
    perspective
  • Dynamic opens additional market segments (i.e.,
    all software developers) that otherwise might not
    use desktop CAD

7
IntroductionWarp Processors
2
Profile application to determine critical regions
1
Initially execute application in software only
3
Profiler
Partition critical regions to hardware
MIPS/ARM
I
5
D
Partitioned application executes faster with
lower energy consumption
Configurable Logic
Dynamic Part. Module (DPM)
4
Program configurable logic update software
binary
8
Warp ProcessorsRequirements Tools
  • Warp Processor Architecture and Tools
  • Basic configurable logic architecture
  • Efficient profiling architecture
  • On-chip CAD tools for HW/SW partitioning
  • Decompilation
  • Synthesis
  • Technology Mapping
  • Placement and Routing

Profiler
ARM
I
D
Config. Logic
DPM
9
Warp Configurable Logic ArchitectureRequirements
  • Robustness
  • Capable of supporting large set of applications
  • Simplicity
  • Existing FPGAs are too complex for warp
    processors
  • Design goals of FPGAs much different
  • Design configurable fabric by analyzing
    architectural features as to their impacts on
    on-chip CAD tools
  • Fast execution
  • Very low data memory
  • Produce reasonable hardware circuits
  • Efficient interface to memory

10
Warp Configurable Logic Architecture
  • Data address generators (DADG) and Loop control
    hardware (LCH)
  • Found in most digital signal processors
  • Provide fast loop execution
  • Supports memory accesses with regular access
    pattern
  • Synthesis of FSM not required for many critical
    loops
  • Configurable logic fabric input provide
    alternative control of loop execution

DADG LCH
32-bit MAC
Configurable Logic Fabric
11
Warp Configurable Logic Architecture
  • Integrated 32-bit multiplier-accumulator (MAC)
  • Multiplications are frequently found within
    critical loops
  • Frequently in the form of a multiply-accumulate
    operation
  • Fast, single-cycle multipliers are large and
    require many interconnections

DADG LCH
32-bit MAC
Configurable Logic Fabric
12
Warp Configurable Logic Architecture
Configurable Logic Fabric
  • Array of configurable logic blocks (CLBs)
    surrounded by switch matrices (SMs)
  • Each CLB is directly connected to a SM
  • Switch matrix connections
  • Four short wires connect adjacent SMs
  • Four long wires connect every other SM together

SM
SM
SM
CLB
CLB
SM
SM
SM
13
Warp Configurable Logic Architecture
Combinational Logic Block Design
  • Several studies have analyzed the impact of LUT
    and CLB size of overall design area and delay
  • LUTs with 5 to 6 inputs result in best
    performance
  • LUTs with less than 3 inputs have much worse
    performance Chow, et al. 1999, Singh, et al.
    1992
  • CLB cluster size of 3 to 20 LUTs are feasible
    Marquardt, Betz, Rose 2000

14
Warp Configurable Logic Architecture
Combinational Logic Block Design
  • Incorporate two 3-input 2-output LUTs
  • Corresponds to four 3-input LUTs
  • Allows for good quality circuit while reducing
    on-chip CAD tools complexity
  • Provide routing resources between adjacent CLBs
    to support carry chains

FPGAs WCLA
Flexibility Large CLBs, various internal routing resources Simplicity Limited internal routing, reduce technology mapping complexity
15
Warp Configurable Logic ArchitectureSwitch Matrix
  • Switch Matrix
  • SM connected using eight channels per side
  • Four short channels
  • Four long channels
  • Routes connect wires from different side using
    the same channel
  • Each short channel is associated with single long
    channel
  • Wires are routed using a single pair of channels
    through configurable logic fabric

FPGAs WCLA
Flexibility Large routing resources, requires complex routing algorithms Simplicity Allow for design of less complex routing algorithm
16
ResultsBenchmarks
  • Considered 12 embedded benchmarks from NetBench,
    MediaBench, EEMBC, and Powerstone
  • Average of 53 of total software execution time
    was spent executing single critical loop (more
    speedup possible if more loops considered)
  • On average, critical loops comprised only 1 of
    total program size

17
ResultsExperimental Setup
  • Warp Processor
  • 75 MHz ARM7 processor
  • Configurable logic fabric with fixed frequency of
    60 MHz
  • Used dynamic partitioning CAD tools to map
    critical region to hardware
  • Executed on an ARM7 processor
  • Active for roughly 10 seconds to perform
    partitioning
  • Traditional HW/SW Partitioning
  • 75 MHz ARM7 processor
  • Xilinx Virtex-E FPGA (executing at maximum
    possible speed)
  • Manually partitioned software using VHDL
  • VHDL synthesized using Xilinx ISE 4.1 on desktop

18
ResultsPerformance Speedup
19
ResultsEnergy Reduction
20
Context UCRs Research on Configurable SoCs
Self Tuning, Self Configuring Mass Produced ICs
21
Conclusions Future Work
  • Warp Configurable Logic Fabric
  • Supports wide range of embedded systems
    applications
  • Design specifically to allow development of lean
    on-chip CAD tools
  • Provide excellent results
  • Average speedups of 2.1
  • Average energy reduction of 33
  • Much better than dynamic software optimizations
  • One loop only more speedup possible
  • More recent examples since DATE publication 10x
    speedups
  • Working towards examples with 100x speedups
  • Future Work
  • Partitioning multiple software loops to hardware
  • Synthesizing Finite State Machines (FSMs)
  • Improved synthesis, technology mapping, and place
    and route
About PowerShow.com