Compiler-directed Synthesis of Multifunction Loop Accelerators - PowerPoint PPT Presentation

About This Presentation
Title:

Compiler-directed Synthesis of Multifunction Loop Accelerators

Description:

Single hardware accelerator to run multiple loops ... Compiler-directed design system. Multifunction accelerator for hardware reuse. Two multifunction design methods ... – PowerPoint PPT presentation

Number of Views:22
Avg rating:3.0/5.0
Slides: 21
Provided by: fank
Category:

less

Transcript and Presenter's Notes

Title: Compiler-directed Synthesis of Multifunction Loop Accelerators


1
Compiler-directed Synthesis of Multifunction Loop
Accelerators
  • Kevin Fan, Manjunath Kudlur,
  • Hyunchul Park, Scott Mahlke
  • Advanced Computer Architecture Laboratory
  • University of Michigan

2
Accelerating Streaming Applications
DRAM
  • Streaming applications
  • Discrete transformations operating on data stream
  • High performance
  • Map application to pipeline of accelerators
  • Multifunction accelerators reuse hardware
  • Improve hardware efficiency

Loop 1
Loop Accelerator
LA1
Frame Type?
Loop 2
Loop 3
Multifunction Loop Accelerator
LA2
Loop 4
Multifunction Loop Accelerator
LA3
Block 5


Accelerator Pipeline
Application
3
Loop Accelerator Schema
  • Hard wired state machine for one or more critical
    loops
  • Order of magnitude power and performance
    improvements over more general designs

4
Single Function Accelerator Design
  • Use compiler as architecture synthesis tool
  • Parameterized meta-architecture all loop
    accelerators have same general organization
  • Performance/throughput is input
  • Compiler analysis to understand computation and
    communication requirements
  • Hardware-sensitive optimization to reduce cost

5
Flow Diagram
Application Loop, Desired II
Allocate FUs
Concrete Arch
FU
FU
FU
FU
Instantiate Arch
Abstract Arch
RF
Modulo Schedule
Verilog, Control Signals
Scheduled Ops
Synthesize
Build Datapath
Loop Accelerator
6
FU Allocation
  • Given operations in a loop and cost of hardware
    cells implementing those operations
  • Minimize total FU cost while supporting all
    operations

II 2
3 ? ADD 1 ? SUB 2 ? LOAD

-
MEM
7
Modulo Scheduling andDatapath Derivation
  • Schedule to abstract architecture (FUs)
  • Determine register and interconnect requirements
    from schedule

r1 Memr2 r3 r1 12
Source Code
8
Multifunction Accelerator
  • Single hardware accelerator to run multiple loops
  • Could place single function accelerators side by
    side
  • Want to exploit potential hardware sharing
    between loops
  • Function units
  • Registers
  • Interconnect

9
Multifunction Design Strategies
  • 1. Union Method

FU
FU
FU
FU
FU
FU
2. Phase Ordered Method

FU
FU
FU
FU
10
Union Method
Goal combine FUs and register files to improve
hardware sharing.
Positional Union

-
M
M
Accel 1



M
Accel 2
11
Union Method
  • Smart union formulated as ILP problem which
    minimizes FU and register cost
  • Benefit Look at whole design at once
  • Limitation Schedules are fixed prior to union
    phase
  • Fast runtime

12
Cost of Union of Accelerators
Image Processing
MPEG4
Signal Processing
Worst union 25 average savings Positional
union 29 average savings Best union 33
average savings
13
Phase Ordered Method
  • Schedule loops in order
  • During scheduling, account for hardware from
    previous loop
  • Cost sensitive scheduler attempts to minimize
    hardware cost increase


FU
FU
FU
FU
Loop 1
Loop 2
Accel 1
Accel 12
14
Cost Sensitive Scheduling
  • Different valid scheduling alternatives are not
    equal

FU1
FU2
FU3
0
1
2
FU1
FU2
FU3
1
time
LD1
1
2
2
LD2
LD1
LD2
15
Greedy Cost Sensitive Scheduler
  • Select scheduling alternative with minimum cost
  • Account for estimated cost of unscheduled ops

Loop 1
1
2
Modulo Scheduler
4
3
5
Costi
Alti
Hardware Cost Modeler
16
Phase Ordered Method
  • Extend conventional iterative modulo scheduler
    with hardware cost model
  • Benefits
  • Scheduler is aware of hardware for all previously
    scheduled loops
  • Can adjust schedule to improve cost savings
  • Limitation process is localized, greedy.
    Schedules of previous loops are fixed
  • Fast runtime

17
Cost Sensitive Scheduling Comparison
Image Processing
MPEG4
Signal Processing
Greedy scheduling 41 average savings ILP
scheduling 51 average savings
18
Union vs. Phase Ordered Methods
Image Processing
MPEG4
Signal Processing
Union method 45 average savings Phase ordered
method 41 average savings
19
Conclusion
  • Compiler-directed design system
  • Multifunction accelerator for hardware reuse
  • Two multifunction design methods
  • Smart union of single-function accelerators 45
    average cost savings
  • Phase ordered scheduling 41 average cost
    savings
  • Overall, 20 61 hardware savings from sharing

20
Questions?
Write a Comment
User Comments (0)
About PowerShow.com