Modulo Scheduling for Highly Customized Datapaths to Increase Hardware Reusability - PowerPoint PPT Presentation

About This Presentation
Title:

Modulo Scheduling for Highly Customized Datapaths to Increase Hardware Reusability

Description:

Emerging applications have high performance, cost, energy demands ... Applications dominated by tight loops processing large amounts of streaming data. iPhone board ... – PowerPoint PPT presentation

Number of Views:114
Avg rating:3.0/5.0
Slides: 22
Provided by: fank
Category:

less

Transcript and Presenter's Notes

Title: Modulo Scheduling for Highly Customized Datapaths to Increase Hardware Reusability


1
Modulo Scheduling for Highly Customized Datapaths
to Increase Hardware Reusability
  • Kevin Fan, Hyunchul Park,Manjunath Kudlur, Scott
    Mahlke
  • Advanced Computer Architecture Laboratory
  • University of Michigan
  • April 8, 2008

2
Introduction
  • Emerging applications have high performance,
    cost, energy demands
  • H.264, wireless, software radio, signal
    processing
  • 10-100 Gops required
  • 200 mW power budget
  • Applications dominated by tight loops processing
    large amounts of streaming data

iPhone board
3
Loop Accelerators
C Code
Hardware
Loop
4
Hardware Implementations
  • Customization gets order-of-magnitude performance
    and efficiency wins
  • Viterbi 100x speedup vs. ARM9

FPGAs
General PurposeProcessors
DSPs
CGRAs
Flexibility
Multifunction Loop Accelerators
Loop Accelerators, ASICs
Efficiency, Performance
5
What About Programmability?
  • Software changes bug fixes, evolving standards
  • dct_8x8() from H.264 reference implementation

for (coeff_ctr 0 coeff_ctr lt 64
coeff_ctr) ipos_scancoeff_ctr0
jpos_scancoeff_ctr1 run
ilev0 if (currMB-gtluma_transform_size_8x8
_flag input-gtsymbol_mode CAVLC)
MCcoeff MC(coeff_ctr)
runsMCcoeff m7
curr_resblock_y jblock_x level
iabs (m7i) if (img-gtAdaptiveRounding)
fadjust8x8jblock_xi 0
if (level ! 0) nonzero
TRUE if (currMB-gtluma_transform_size_8x8_
flag input-gtsymbol_mode CAVLC)
coeff_cost MAX_VALUE
img-gtcofACb8pl_offMCcoeff0scan_possMCcoef
f isignab(level,m7i)
img-gtcofACb8pl_offMCcoeff1scan_possMCcoef
f runsMCcoeff scan_pos
runsMCcoeff-1 else
coeff_cost MAX_VALUE
ACLevelscan_pos isignab(level,m7i)
ACRun scan_pos run
run-1 // reset zero level
counter level isignab(level,
m7i) ilev level
for (coeff_ctr 0 coeff_ctr lt 64
coeff_ctr) ipos_scancoeff_ctr0
jpos_scancoeff_ctr1 run
ilev0 if (currMB-gtluma_transform_size_8x8
_flag input-gtsymbol_mode CAVLC)
MCcoeff MC(coeff_ctr)
runsMCcoeff m7
curr_resblock_y jblock_x level
iabs (m7i) if (img-gtAdaptiveRounding)
fadjust8x8jblock_xi 0
if (level ! 0) nonzero
TRUE if (currMB-gtluma_transform_size_8x8_
flag input-gtsymbol_mode CAVLC)
coeff_cost MAX_VALUE
img-gtcofACpl_offMCcoeff0scan_possMCcoeff
isignab(level,m7i)
img-gtcofACpl_offMCcoeff1scan_possMCcoeff
runsMCcoeff scan_pos
runsMCcoeff-1 else
coeff_cost MAX_VALUE
ACLevelscan_pos isignab(level,m7i)
ACRun scan_pos run run-1
// reset zero level counter
level isignab(level, m7i)
ilev level
for (coeff_ctr 0 coeff_ctr lt 64
coeff_ctr) ipos_scancoeff_ctr0
jpos_scancoeff_ctr1 run
ilev0 if (currMB-gtluma_transform_size_8x8
_flag input-gtsymbol_mode CAVLC)
MCcoeff MC(coeff_ctr)
runsMCcoeff m7
curr_resblock_y jblock_x level
iabs (m7i) if (img-gtAdaptiveRounding)
fadjust8x8jblock_xi 0
if (level ! 0) nonzero
TRUE if (currMB-gtluma_transform_size_8x8_
flag input-gtsymbol_mode CAVLC)
coeff_cost MAX_VALUE
img-gtcofACpl_offMCcoeff0scan_possMCcoeff
isignab(level,m7i)
img-gtcofACpl_offMCcoeff1scan_possMCcoeff
runsMCcoeff scan_pos
runsMCcoeff-1 else
coeff_cost MAX_VALUE
ACLevelscan_pos isignab(level,m7i)
ACRun scan_pos run run-1
// reset zero level counter
level isignab(level, m7i)
ilev level
Version 13.0
Version 13.1
Version 13.2
6
Programmable Loop Accelerator
  • Reusable hardware ? reduced NRE costs
  • Generalize accelerator without losing efficiency

FPGAs
General PurposeProcessors
DSPs
CGRAs
Flexibility
Programmable Loop Accelerators
Multifunction Loop Accelerators
Loop Accelerators, ASICs
Efficiency, Performance
7
Flexible Accelerators
SynthesisSystem
Hardware
Compiler
Loop 2
  • Generalize accelerator architecture
  • Map new loops to existing hardware

8
Loop Accelerator Architecture
CRF
Point-to-point Connections






FSM
Local Mem


MEM
BR
Controlsignals
  • Hardware realization of modulo scheduled loop
  • Parameterized execution resources, storage,
    connectivity

9
Programmable Accelerator Architecture
  • Generalize architectural features that limit
    programmability

CRF
Literals
Point-to-point Connections
Bus






Control Memory
Local Mem
/-
/
MEM
BR
Controlsignals
RR
RR
RR
RR
  • 50 area overhead vs. non-programmable
    accelerator

10
Mapping Loops onto Hardware
Processor
Accelerator
General-purpose Customized
Central register file Distributed registers
Homogeneous Point-to-point
FUs
Storage
Connectivity
ALU
ALU
LD
/-

CRF
8
16
8
11
Scheduling Example
ADDER1
ADDER2
MEM
LD1
Time
2
3
0
1
2
3
4
LD1
II2
4
5
2
3
LD1
4
5
?
2
3
4
LD1
12
Modulo Scheduling for LAs
Loop
Move Insertion
SMT Scheduling
Register Allocation
Control Signals
Machine description
Increment II
  • Large search space, few solutions
  • Op-centric approaches unable to find solutions
  • Satisfiability Modulo Theory (SMT) formulation to
    solve linear and SAT constraints simultaneously

13
SMT Formulation
  • Boolean variables Xi,f,t are true if operation i
    is scheduled on FU f at time slot t.
  • Integer variables Si represent stage of operation
    i.

i
lat(i)
dist(i,j)
sched_time(j) ? sched_time(i) lat(i)
dist(i,j) ? II
j
( Xi,fi,ti ? Xj,fj,tj ) ? (

)
Sj ? II tj ? Si ? II ti lat(i)
dist(i,j) ? II
  • More details in paper

14
Measuring Programmability
  • How well can different loops be mapped onto the
    same hardware?
  • Performance matters how much does II increase?
  • Need set of loops with different degrees of
    similarity

?
15
Graph Perturbation
  • Synthetically generated graphs
  • More perturbations ? less similar to original
    graph
  • Iteratively apply random transformations

Add edge between existing operations
Add edge with new producer
Add edge with new consumer
Remove edge
16
Results Perturbed Graphs
Average II increase
Base II
4
8
7
2
4
4
4
4
6
9
MPEG4
Signal processing
Image
Math
17
Results Restricted Datapath
18
Conclusion
  • Increase flexibility of customized hardware
    without sacrificing performance, efficiency
  • Successfully map loops to heterogeneous hardware
  • Compile times of 5 minutes 1 hour
  • Software changing faster than hardware ?
    patchable ASIC

19
Questions?
20
(No Transcript)
21
Results Cross Compilation
Write a Comment
User Comments (0)
About PowerShow.com