Title: Bridging the Computation Gap Between Programmable Processors and Hardwired Accelerators
1Bridging the Computation Gap Between Programmable
Processors and Hardwired Accelerators
- Kevin Fan1, Manjunath Kudlur2,
- Ganesh Dasika, Scott Mahlke
University of Michigan Advanced Computer
Architecture Laboratory
1Parakinetics, Inc.
2Nvidia
2Introduction
MPEG-4 Decoder
Cell-phone battery life (hours)
Frames/sec
- Emerging applications have high performance,
cost, energy demands - High-quality video
- Flash animation
- Clear need for application and domain-specific
hardware
3Flexibility
- Multiple instances of the same application
- E.g multiple video codecs
- Software algorithms change over time
- NRE
- Time-to-market
Xvid
DivX
FFMpeg
4ASIC Alternatives
FPGAs
Highly efficient, some programmability
General PurposeProcessors
DSPs
Domain-specific accelerators
Flexibility
???
Loop Accelerators, ASICs
Efficiency, Performance
5How much post-programmabilityis really required?
mdct.c in faad2
for(k0 kltN4 k) ... real Z1k0
img Z1k1 Z1k0 real
sincosk0 - imgsincosk1
Z1k0 Z1k0 ltlt 1
for(k0 kltN4 k) ... real Z1k0
img Z1k1 Z1k0 real
sincosk0 - imgsincosk1
Z1k0 Z1k0 ltlt 1 if(b_scale)
Z1k0 Z1k0 scale
Version 1.39
Version 1.40
5
6How much post-programmabilityis really required?
mdct.c in faad2
for(k0 kltN4 k) ... uint16_t n k ltlt
1 ComplexMult(...) X_out n
RE(x) X_outN2 - 1 - n -IM(x) X_outN2
n IM(x) X_outN - 1 - n -RE(x)
for(k0 kltN4 k) ... uint16_t n k ltlt
1 ComplexMult(...) X_out n
-RE(x) X_outN2 - 1 - n IM(x) X_outN2
n -IM(x) X_outN - 1 - n RE(x)
Version 1.33
Version 1.34
6
7How much post-programmabilityis really required?
H.264 reference implementation
for (coeff_ctr 0 coeff_ctr lt 64
coeff_ctr) ipos_scancoeff_ctr0
jpos_scancoeff_ctr1 run
ilev0 if (currMB-gtluma_transform_size_8x8
_flag input-gtsymbol_mode CAVLC)
MCcoeff MC(coeff_ctr)
runsMCcoeff m7
curr_resblock_y jblock_x level
iabs (m7i) if (img-gtAdaptiveRounding)
fadjust8x8jblock_xi 0
if (level ! 0) nonzero
TRUE if (currMB-gtluma_transform_size_8x8_
flag input-gtsymbol_mode CAVLC)
coeff_cost MAX_VALUE
img-gtcofACb8pl_offMCcoeff0scan_possMCcoef
f isignab(level,m7i)
img-gtcofACb8pl_offMCcoeff1scan_possMCcoef
f runsMCcoeff scan_pos
runsMCcoeff-1 else
coeff_cost MAX_VALUE
ACLevelscan_pos isignab(level,m7i)
ACRun scan_pos run
run-1 // reset zero level
counter level isignab(level,
m7i) ilev level
for (coeff_ctr 0 coeff_ctr lt 64
coeff_ctr) ipos_scancoeff_ctr0
jpos_scancoeff_ctr1 run
ilev0 if (currMB-gtluma_transform_size_8x8
_flag input-gtsymbol_mode CAVLC)
MCcoeff MC(coeff_ctr)
runsMCcoeff m7
curr_resblock_y jblock_x level
iabs (m7i) if (img-gtAdaptiveRounding)
fadjust8x8jblock_xi 0
if (level ! 0) nonzero
TRUE if (currMB-gtluma_transform_size_8x8_
flag input-gtsymbol_mode CAVLC)
coeff_cost MAX_VALUE
img-gtcofACpl_offMCcoeff0scan_possMCcoeff
isignab(level,m7i)
img-gtcofACpl_offMCcoeff1scan_possMCcoeff
runsMCcoeff scan_pos
runsMCcoeff-1 else
coeff_cost MAX_VALUE
ACLevelscan_pos isignab(level,m7i)
ACRun scan_pos run run-1
// reset zero level counter
level isignab(level, m7i)
ilev level
for (coeff_ctr 0 coeff_ctr lt 64
coeff_ctr) ipos_scancoeff_ctr0
jpos_scancoeff_ctr1 run
ilev0 if (currMB-gtluma_transform_size_8x8
_flag input-gtsymbol_mode CAVLC)
MCcoeff MC(coeff_ctr)
runsMCcoeff m7
curr_resblock_y jblock_x level
iabs (m7i) if (img-gtAdaptiveRounding)
fadjust8x8jblock_xi 0
if (level ! 0) nonzero
TRUE if (currMB-gtluma_transform_size_8x8_
flag input-gtsymbol_mode CAVLC)
coeff_cost MAX_VALUE
img-gtcofACpl_offMCcoeff0scan_possMCcoeff
isignab(level,m7i)
img-gtcofACpl_offMCcoeff1scan_possMCcoeff
runsMCcoeff scan_pos
runsMCcoeff-1 else
coeff_cost MAX_VALUE
ACLevelscan_pos isignab(level,m7i)
ACRun scan_pos run run-1
// reset zero level counter
level isignab(level, m7i)
ilev level
Version 13.0
Version 13.1
Version 13.2
- Mostly minor changes to loops
- Bug fixes
- Revisions
- Possible to design custom HW with minor
programmability extensions
7
8Programmable Loop Accelerator
- Generalize accelerator without losing efficiency
FPGAs
General PurposeProcessors
DSPs
Domain-specific accelerators
Flexibility
???
Programmable Loop Accelerators
Loop Accelerators, ASICs
Efficiency, Performance
9Designing Loop Accelerators
10Loop Accelerator Architecture
CRF
Point-to-point Connections
FSM
Local Mem
MEM
BR
Controlsignals
Hardware realization of modulo scheduled
loop Parameterized execution resources, storage,
connectivity
11LA Scheduling
FIR Loop Kernel
Mult result has longer lifetime
Paths missing
No subtract
????
12LA Datapath Restrictions
8
Slow-Down
Graph Difference
12
13Programmable Loop-Accelerator Architecture
CRF
Point-to-point Connections
Control Memory
FSM
Local Mem
/-
/
MEM
BR
Controlsignals
RR
RR
RR
RR
SRF
SRF
SRF
SRF
LA
PLA
- Functionality
- Storage
- Connectivity
- Control
Custom FU set
Generalized FUs MOVs
Limited size, no addr.
Rotating Reg. Files
Point-to-point
Bus Port-swapping
Hardwired Control
Lit. Reg. File Control Mem
14Experimental Setup
- Wide variety of benchmarks
- DSP
- Media
- Linear Algebra
- Baseline LAs
- Used LA synthesis system to generate HDL
- 200 MHz _at_ 0.13um
- Comparisons
- PLAs (200 MHz _at_ 0.13um)
- OR-1200 (300 MHz _at_ 0.13um)
15Area
OR-1200
16Power Consumption
OR-1200
17Power Consumption
OR-1200
OR-1200 equiv
18Power Breakdown
19Scheduling for PLAs
SynthesisSystem
Hardware
Compiler SMT-solver
Loop 2
- Generalize accelerator architecture
- Map new loops to existing hardware
20PLA Scheduling
21PLA Scheduling
SMT
22Programmability
Small, with complex communication
Small, with simple communication
23Power Efficiency
LA 105 MIPS/mW
PLA 24 MIPS/mW
Tensilica Diamond Core 12 MIPS/mW
Performance (MIPS)
TI C6x 5 MIPS/mW
ARM11 3 MIPS/mW
OR1K 2 MIPS/mW
Itanium2 0.08 MIPS/mW
Power (mW)
24Conclusion
- Programmable loop accelerators retain efficiency
while being programmable - Loop accelerator datapath generalized in a
cost-effective way - Significant benefits over GPP
- 4x-34x improved power efficiency
- 30x improved area efficiency
25Questions?
http//cccp.eecs.umich.edu
26(No Transcript)
27(No Transcript)