Modulo Scheduling for Highly Customized Datapaths to Increase Hardware Reusability

About This Presentation

Title:

Modulo Scheduling for Highly Customized Datapaths to Increase Hardware Reusability

Description:

Emerging applications have high performance, cost, energy demands ... Applications dominated by tight loops processing large amounts of streaming data. iPhone board ... – PowerPoint PPT presentation

Number of Views:114

Avg rating:3.0/5.0

Slides: 22

Provided by: fank

Learn more at: https://cccp.eecs.umich.edu

Category:

more less

Transcript and Presenter's Notes

Title: Modulo Scheduling for Highly Customized Datapaths to Increase Hardware Reusability

1
Modulo Scheduling for Highly Customized Datapaths
to Increase Hardware Reusability

Kevin Fan, Hyunchul Park,Manjunath Kudlur, Scott
Mahlke
Advanced Computer Architecture Laboratory
University of Michigan
April 8, 2008

2
Introduction

Emerging applications have high performance,
cost, energy demands
H.264, wireless, software radio, signal
processing
10-100 Gops required
200 mW power budget
Applications dominated by tight loops processing
large amounts of streaming data

iPhone board
3
Loop Accelerators
C Code
Hardware
Loop
4
Hardware Implementations

Customization gets order-of-magnitude performance
and efficiency wins
Viterbi 100x speedup vs. ARM9

FPGAs
General PurposeProcessors
DSPs
CGRAs
Flexibility
Multifunction Loop Accelerators
Loop Accelerators, ASICs
Efficiency, Performance
5
What About Programmability?

Software changes bug fixes, evolving standards
dct_8x8() from H.264 reference implementation

for (coeff_ctr 0 coeff_ctr lt 64
coeff_ctr) ipos_scancoeff_ctr0
jpos_scancoeff_ctr1 run
ilev0 if (currMB-gtluma_transform_size_8x8
_flag input-gtsymbol_mode CAVLC)
MCcoeff MC(coeff_ctr)
runsMCcoeff m7
curr_resblock_y jblock_x level
iabs (m7i) if (img-gtAdaptiveRounding)
fadjust8x8jblock_xi 0
if (level ! 0) nonzero
TRUE if (currMB-gtluma_transform_size_8x8_
flag input-gtsymbol_mode CAVLC)
coeff_cost MAX_VALUE
img-gtcofACb8pl_offMCcoeff0scan_possMCcoef
f isignab(level,m7i)
img-gtcofACb8pl_offMCcoeff1scan_possMCcoef
f runsMCcoeff scan_pos
runsMCcoeff-1 else
coeff_cost MAX_VALUE
ACLevelscan_pos isignab(level,m7i)
ACRun scan_pos run
run-1 // reset zero level
counter level isignab(level,
m7i) ilev level
for (coeff_ctr 0 coeff_ctr lt 64
coeff_ctr) ipos_scancoeff_ctr0
jpos_scancoeff_ctr1 run
ilev0 if (currMB-gtluma_transform_size_8x8
_flag input-gtsymbol_mode CAVLC)
MCcoeff MC(coeff_ctr)
runsMCcoeff m7
curr_resblock_y jblock_x level
iabs (m7i) if (img-gtAdaptiveRounding)
fadjust8x8jblock_xi 0
if (level ! 0) nonzero
TRUE if (currMB-gtluma_transform_size_8x8_
flag input-gtsymbol_mode CAVLC)
coeff_cost MAX_VALUE
img-gtcofACpl_offMCcoeff0scan_possMCcoeff
isignab(level,m7i)
img-gtcofACpl_offMCcoeff1scan_possMCcoeff
runsMCcoeff scan_pos
runsMCcoeff-1 else
coeff_cost MAX_VALUE
ACLevelscan_pos isignab(level,m7i)
ACRun scan_pos run run-1
// reset zero level counter
level isignab(level, m7i)
ilev level
for (coeff_ctr 0 coeff_ctr lt 64
coeff_ctr) ipos_scancoeff_ctr0
jpos_scancoeff_ctr1 run
ilev0 if (currMB-gtluma_transform_size_8x8
_flag input-gtsymbol_mode CAVLC)
MCcoeff MC(coeff_ctr)
runsMCcoeff m7
curr_resblock_y jblock_x level
iabs (m7i) if (img-gtAdaptiveRounding)
fadjust8x8jblock_xi 0
if (level ! 0) nonzero
TRUE if (currMB-gtluma_transform_size_8x8_
flag input-gtsymbol_mode CAVLC)
coeff_cost MAX_VALUE
img-gtcofACpl_offMCcoeff0scan_possMCcoeff
isignab(level,m7i)
img-gtcofACpl_offMCcoeff1scan_possMCcoeff
runsMCcoeff scan_pos
runsMCcoeff-1 else
coeff_cost MAX_VALUE
ACLevelscan_pos isignab(level,m7i)
ACRun scan_pos run run-1
// reset zero level counter
level isignab(level, m7i)
ilev level
Version 13.0
Version 13.1
Version 13.2
6
Programmable Loop Accelerator

Reusable hardware ? reduced NRE costs
Generalize accelerator without losing efficiency

FPGAs
General PurposeProcessors
DSPs
CGRAs
Flexibility
Programmable Loop Accelerators
Multifunction Loop Accelerators
Loop Accelerators, ASICs
Efficiency, Performance
7
Flexible Accelerators
SynthesisSystem
Hardware
Compiler
Loop 2

Generalize accelerator architecture
Map new loops to existing hardware

8
Loop Accelerator Architecture
CRF
Point-to-point Connections

FSM
Local Mem

MEM
BR
Controlsignals

Hardware realization of modulo scheduled loop
Parameterized execution resources, storage,
connectivity

9
Programmable Accelerator Architecture

Generalize architectural features that limit
programmability

CRF
Literals
Point-to-point Connections
Bus

Control Memory
Local Mem
/-
/
MEM
BR
Controlsignals
RR
RR
RR
RR

50 area overhead vs. non-programmable
accelerator

10
Mapping Loops onto Hardware
Processor
Accelerator
General-purpose Customized
Central register file Distributed registers
Homogeneous Point-to-point
FUs
Storage
Connectivity
ALU
ALU
LD
/-

CRF
8
16
8
11
Scheduling Example
ADDER1
ADDER2
MEM
LD1
Time
2
3
0
1
2
3
4
LD1
II2
4
5
2
3
LD1
4
5
?
2
3
4
LD1
12
Modulo Scheduling for LAs
Loop
Move Insertion
SMT Scheduling
Register Allocation
Control Signals
Machine description
Increment II

Large search space, few solutions
Op-centric approaches unable to find solutions
Satisfiability Modulo Theory (SMT) formulation to
solve linear and SAT constraints simultaneously

13
SMT Formulation

Boolean variables Xi,f,t are true if operation i
is scheduled on FU f at time slot t.
Integer variables Si represent stage of operation
i.

i
lat(i)
dist(i,j)
sched_time(j) ? sched_time(i) lat(i)
dist(i,j) ? II
j
( Xi,fi,ti ? Xj,fj,tj ) ? (

)
Sj ? II tj ? Si ? II ti lat(i)
dist(i,j) ? II

More details in paper

14
Measuring Programmability

How well can different loops be mapped onto the
same hardware?
Performance matters how much does II increase?
Need set of loops with different degrees of
similarity

?
15
Graph Perturbation

Synthetically generated graphs
More perturbations ? less similar to original
graph
Iteratively apply random transformations

Add edge between existing operations
Add edge with new producer
Add edge with new consumer
Remove edge
16
Results Perturbed Graphs
Average II increase
Base II
4
8
7
2
4
4
4
4
6
9
MPEG4
Signal processing
Image
Math
17
Results Restricted Datapath
18
Conclusion

Increase flexibility of customized hardware
without sacrificing performance, efficiency
Successfully map loops to heterogeneous hardware
Compile times of 5 minutes 1 hour
Software changing faster than hardware ?
patchable ASIC

19
Questions?
20
(No Transcript)
21
Results Cross Compilation

Write a Comment

User Comments (0)