Title: Modulo Scheduling for Highly Customized Datapaths to Increase Hardware Reusability
1Modulo Scheduling for Highly Customized Datapaths
to Increase Hardware Reusability
- Kevin Fan, Hyunchul Park,Manjunath Kudlur, Scott
Mahlke - Advanced Computer Architecture Laboratory
- University of Michigan
- April 8, 2008
2Introduction
- Emerging applications have high performance,
cost, energy demands - H.264, wireless, software radio, signal
processing - 10-100 Gops required
- 200 mW power budget
- Applications dominated by tight loops processing
large amounts of streaming data
iPhone board
3Loop Accelerators
C Code
Hardware
Loop
4Hardware Implementations
- Customization gets order-of-magnitude performance
and efficiency wins - Viterbi 100x speedup vs. ARM9
FPGAs
General PurposeProcessors
DSPs
CGRAs
Flexibility
Multifunction Loop Accelerators
Loop Accelerators, ASICs
Efficiency, Performance
5What About Programmability?
- Software changes bug fixes, evolving standards
- dct_8x8() from H.264 reference implementation
for (coeff_ctr 0 coeff_ctr lt 64
coeff_ctr) ipos_scancoeff_ctr0
jpos_scancoeff_ctr1 run
ilev0 if (currMB-gtluma_transform_size_8x8
_flag input-gtsymbol_mode CAVLC)
MCcoeff MC(coeff_ctr)
runsMCcoeff m7
curr_resblock_y jblock_x level
iabs (m7i) if (img-gtAdaptiveRounding)
fadjust8x8jblock_xi 0
if (level ! 0) nonzero
TRUE if (currMB-gtluma_transform_size_8x8_
flag input-gtsymbol_mode CAVLC)
coeff_cost MAX_VALUE
img-gtcofACb8pl_offMCcoeff0scan_possMCcoef
f isignab(level,m7i)
img-gtcofACb8pl_offMCcoeff1scan_possMCcoef
f runsMCcoeff scan_pos
runsMCcoeff-1 else
coeff_cost MAX_VALUE
ACLevelscan_pos isignab(level,m7i)
ACRun scan_pos run
run-1 // reset zero level
counter level isignab(level,
m7i) ilev level
for (coeff_ctr 0 coeff_ctr lt 64
coeff_ctr) ipos_scancoeff_ctr0
jpos_scancoeff_ctr1 run
ilev0 if (currMB-gtluma_transform_size_8x8
_flag input-gtsymbol_mode CAVLC)
MCcoeff MC(coeff_ctr)
runsMCcoeff m7
curr_resblock_y jblock_x level
iabs (m7i) if (img-gtAdaptiveRounding)
fadjust8x8jblock_xi 0
if (level ! 0) nonzero
TRUE if (currMB-gtluma_transform_size_8x8_
flag input-gtsymbol_mode CAVLC)
coeff_cost MAX_VALUE
img-gtcofACpl_offMCcoeff0scan_possMCcoeff
isignab(level,m7i)
img-gtcofACpl_offMCcoeff1scan_possMCcoeff
runsMCcoeff scan_pos
runsMCcoeff-1 else
coeff_cost MAX_VALUE
ACLevelscan_pos isignab(level,m7i)
ACRun scan_pos run run-1
// reset zero level counter
level isignab(level, m7i)
ilev level
for (coeff_ctr 0 coeff_ctr lt 64
coeff_ctr) ipos_scancoeff_ctr0
jpos_scancoeff_ctr1 run
ilev0 if (currMB-gtluma_transform_size_8x8
_flag input-gtsymbol_mode CAVLC)
MCcoeff MC(coeff_ctr)
runsMCcoeff m7
curr_resblock_y jblock_x level
iabs (m7i) if (img-gtAdaptiveRounding)
fadjust8x8jblock_xi 0
if (level ! 0) nonzero
TRUE if (currMB-gtluma_transform_size_8x8_
flag input-gtsymbol_mode CAVLC)
coeff_cost MAX_VALUE
img-gtcofACpl_offMCcoeff0scan_possMCcoeff
isignab(level,m7i)
img-gtcofACpl_offMCcoeff1scan_possMCcoeff
runsMCcoeff scan_pos
runsMCcoeff-1 else
coeff_cost MAX_VALUE
ACLevelscan_pos isignab(level,m7i)
ACRun scan_pos run run-1
// reset zero level counter
level isignab(level, m7i)
ilev level
Version 13.0
Version 13.1
Version 13.2
6Programmable Loop Accelerator
- Reusable hardware ? reduced NRE costs
- Generalize accelerator without losing efficiency
FPGAs
General PurposeProcessors
DSPs
CGRAs
Flexibility
Programmable Loop Accelerators
Multifunction Loop Accelerators
Loop Accelerators, ASICs
Efficiency, Performance
7Flexible Accelerators
SynthesisSystem
Hardware
Compiler
Loop 2
- Generalize accelerator architecture
- Map new loops to existing hardware
8Loop Accelerator Architecture
CRF
Point-to-point Connections
FSM
Local Mem
MEM
BR
Controlsignals
- Hardware realization of modulo scheduled loop
- Parameterized execution resources, storage,
connectivity
9Programmable Accelerator Architecture
- Generalize architectural features that limit
programmability
CRF
Literals
Point-to-point Connections
Bus
Control Memory
Local Mem
/-
/
MEM
BR
Controlsignals
RR
RR
RR
RR
- 50 area overhead vs. non-programmable
accelerator
10Mapping Loops onto Hardware
Processor
Accelerator
General-purpose Customized
Central register file Distributed registers
Homogeneous Point-to-point
FUs
Storage
Connectivity
ALU
ALU
LD
/-
CRF
8
16
8
11Scheduling Example
ADDER1
ADDER2
MEM
LD1
Time
2
3
0
1
2
3
4
LD1
II2
4
5
2
3
LD1
4
5
?
2
3
4
LD1
12Modulo Scheduling for LAs
Loop
Move Insertion
SMT Scheduling
Register Allocation
Control Signals
Machine description
Increment II
- Large search space, few solutions
- Op-centric approaches unable to find solutions
- Satisfiability Modulo Theory (SMT) formulation to
solve linear and SAT constraints simultaneously
13SMT Formulation
- Boolean variables Xi,f,t are true if operation i
is scheduled on FU f at time slot t. - Integer variables Si represent stage of operation
i.
i
lat(i)
dist(i,j)
sched_time(j) ? sched_time(i) lat(i)
dist(i,j) ? II
j
( Xi,fi,ti ? Xj,fj,tj ) ? (
)
Sj ? II tj ? Si ? II ti lat(i)
dist(i,j) ? II
14Measuring Programmability
- How well can different loops be mapped onto the
same hardware? - Performance matters how much does II increase?
- Need set of loops with different degrees of
similarity
?
15Graph Perturbation
- Synthetically generated graphs
- More perturbations ? less similar to original
graph - Iteratively apply random transformations
Add edge between existing operations
Add edge with new producer
Add edge with new consumer
Remove edge
16Results Perturbed Graphs
Average II increase
Base II
4
8
7
2
4
4
4
4
6
9
MPEG4
Signal processing
Image
Math
17Results Restricted Datapath
18Conclusion
- Increase flexibility of customized hardware
without sacrificing performance, efficiency - Successfully map loops to heterogeneous hardware
- Compile times of 5 minutes 1 hour
- Software changing faster than hardware ?
patchable ASIC
19Questions?
20(No Transcript)
21Results Cross Compilation