Bridging the Computation Gap Between Programmable Processors and Hardwired Accelerators - PowerPoint PPT Presentation

About This Presentation
Title:

Bridging the Computation Gap Between Programmable Processors and Hardwired Accelerators

Description:

Bridging the Computation Gap Between Programmable Processors and ... Generalized FUs MOVs. Point-to-point. Bus Port-swapping. Limited size, no addr. ... – PowerPoint PPT presentation

Number of Views:94
Avg rating:3.0/5.0
Slides: 28
Provided by: cccpEec
Category:

less

Transcript and Presenter's Notes

Title: Bridging the Computation Gap Between Programmable Processors and Hardwired Accelerators


1
Bridging the Computation Gap Between Programmable
Processors and Hardwired Accelerators
  • Kevin Fan1, Manjunath Kudlur2,
  • Ganesh Dasika, Scott Mahlke

University of Michigan Advanced Computer
Architecture Laboratory
1Parakinetics, Inc.
2Nvidia
2
Introduction
MPEG-4 Decoder
Cell-phone battery life (hours)
Frames/sec
  • Emerging applications have high performance,
    cost, energy demands
  • High-quality video
  • Flash animation
  • Clear need for application and domain-specific
    hardware

3
Flexibility
  • Multiple instances of the same application
  • E.g multiple video codecs
  • Software algorithms change over time
  • NRE
  • Time-to-market

Xvid
DivX
FFMpeg
4
ASIC Alternatives
FPGAs
Highly efficient, some programmability
General PurposeProcessors
DSPs
Domain-specific accelerators
Flexibility
???
Loop Accelerators, ASICs
Efficiency, Performance
5
How much post-programmabilityis really required?
mdct.c in faad2
for(k0 kltN4 k) ... real Z1k0
img Z1k1 Z1k0 real
sincosk0 - imgsincosk1
Z1k0 Z1k0 ltlt 1
for(k0 kltN4 k) ... real Z1k0
img Z1k1 Z1k0 real
sincosk0 - imgsincosk1
Z1k0 Z1k0 ltlt 1 if(b_scale)
Z1k0 Z1k0 scale
Version 1.39
Version 1.40
5
6
How much post-programmabilityis really required?
mdct.c in faad2
for(k0 kltN4 k) ... uint16_t n k ltlt
1 ComplexMult(...) X_out n
RE(x) X_outN2 - 1 - n -IM(x) X_outN2
n IM(x) X_outN - 1 - n -RE(x)
for(k0 kltN4 k) ... uint16_t n k ltlt
1 ComplexMult(...) X_out n
-RE(x) X_outN2 - 1 - n IM(x) X_outN2
n -IM(x) X_outN - 1 - n RE(x)
Version 1.33
Version 1.34
6
7
How much post-programmabilityis really required?
H.264 reference implementation
for (coeff_ctr 0 coeff_ctr lt 64
coeff_ctr) ipos_scancoeff_ctr0
jpos_scancoeff_ctr1 run
ilev0 if (currMB-gtluma_transform_size_8x8
_flag input-gtsymbol_mode CAVLC)
MCcoeff MC(coeff_ctr)
runsMCcoeff m7
curr_resblock_y jblock_x level
iabs (m7i) if (img-gtAdaptiveRounding)
fadjust8x8jblock_xi 0
if (level ! 0) nonzero
TRUE if (currMB-gtluma_transform_size_8x8_
flag input-gtsymbol_mode CAVLC)
coeff_cost MAX_VALUE
img-gtcofACb8pl_offMCcoeff0scan_possMCcoef
f isignab(level,m7i)
img-gtcofACb8pl_offMCcoeff1scan_possMCcoef
f runsMCcoeff scan_pos
runsMCcoeff-1 else
coeff_cost MAX_VALUE
ACLevelscan_pos isignab(level,m7i)
ACRun scan_pos run
run-1 // reset zero level
counter level isignab(level,
m7i) ilev level
for (coeff_ctr 0 coeff_ctr lt 64
coeff_ctr) ipos_scancoeff_ctr0
jpos_scancoeff_ctr1 run
ilev0 if (currMB-gtluma_transform_size_8x8
_flag input-gtsymbol_mode CAVLC)
MCcoeff MC(coeff_ctr)
runsMCcoeff m7
curr_resblock_y jblock_x level
iabs (m7i) if (img-gtAdaptiveRounding)
fadjust8x8jblock_xi 0
if (level ! 0) nonzero
TRUE if (currMB-gtluma_transform_size_8x8_
flag input-gtsymbol_mode CAVLC)
coeff_cost MAX_VALUE
img-gtcofACpl_offMCcoeff0scan_possMCcoeff
isignab(level,m7i)
img-gtcofACpl_offMCcoeff1scan_possMCcoeff
runsMCcoeff scan_pos
runsMCcoeff-1 else
coeff_cost MAX_VALUE
ACLevelscan_pos isignab(level,m7i)
ACRun scan_pos run run-1
// reset zero level counter
level isignab(level, m7i)
ilev level
for (coeff_ctr 0 coeff_ctr lt 64
coeff_ctr) ipos_scancoeff_ctr0
jpos_scancoeff_ctr1 run
ilev0 if (currMB-gtluma_transform_size_8x8
_flag input-gtsymbol_mode CAVLC)
MCcoeff MC(coeff_ctr)
runsMCcoeff m7
curr_resblock_y jblock_x level
iabs (m7i) if (img-gtAdaptiveRounding)
fadjust8x8jblock_xi 0
if (level ! 0) nonzero
TRUE if (currMB-gtluma_transform_size_8x8_
flag input-gtsymbol_mode CAVLC)
coeff_cost MAX_VALUE
img-gtcofACpl_offMCcoeff0scan_possMCcoeff
isignab(level,m7i)
img-gtcofACpl_offMCcoeff1scan_possMCcoeff
runsMCcoeff scan_pos
runsMCcoeff-1 else
coeff_cost MAX_VALUE
ACLevelscan_pos isignab(level,m7i)
ACRun scan_pos run run-1
// reset zero level counter
level isignab(level, m7i)
ilev level
Version 13.0
Version 13.1
Version 13.2
  • Mostly minor changes to loops
  • Bug fixes
  • Revisions
  • Possible to design custom HW with minor
    programmability extensions

7
8
Programmable Loop Accelerator
  • Generalize accelerator without losing efficiency

FPGAs
General PurposeProcessors
DSPs
Domain-specific accelerators
Flexibility
???
Programmable Loop Accelerators
Loop Accelerators, ASICs
Efficiency, Performance
9
Designing Loop Accelerators
10
Loop Accelerator Architecture
CRF
Point-to-point Connections






FSM
Local Mem


MEM
BR
Controlsignals
Hardware realization of modulo scheduled
loop Parameterized execution resources, storage,
connectivity
11
LA Scheduling
FIR Loop Kernel
Mult result has longer lifetime
Paths missing
No subtract
????
12
LA Datapath Restrictions
8
Slow-Down
Graph Difference
12
13
Programmable Loop-Accelerator Architecture
CRF
Point-to-point Connections






Control Memory
FSM
Local Mem
/-
/
MEM
BR


Controlsignals
RR
RR
RR
RR
SRF
SRF
SRF
SRF
LA
PLA
  • Functionality
  • Storage
  • Connectivity
  • Control

Custom FU set
Generalized FUs MOVs
Limited size, no addr.
Rotating Reg. Files
Point-to-point
Bus Port-swapping
Hardwired Control
Lit. Reg. File Control Mem
14
Experimental Setup
  • Wide variety of benchmarks
  • DSP
  • Media
  • Linear Algebra
  • Baseline LAs
  • Used LA synthesis system to generate HDL
  • 200 MHz _at_ 0.13um
  • Comparisons
  • PLAs (200 MHz _at_ 0.13um)
  • OR-1200 (300 MHz _at_ 0.13um)

15
Area
OR-1200
16
Power Consumption
OR-1200
17
Power Consumption
OR-1200
OR-1200 equiv
18
Power Breakdown
19
Scheduling for PLAs
SynthesisSystem
Hardware
Compiler SMT-solver
Loop 2
  • Generalize accelerator architecture
  • Map new loops to existing hardware

20
PLA Scheduling
21
PLA Scheduling
SMT
22
Programmability
Small, with complex communication
Small, with simple communication
23
Power Efficiency
LA 105 MIPS/mW
PLA 24 MIPS/mW
Tensilica Diamond Core 12 MIPS/mW
Performance (MIPS)
TI C6x 5 MIPS/mW
ARM11 3 MIPS/mW
OR1K 2 MIPS/mW
Itanium2 0.08 MIPS/mW
Power (mW)
24
Conclusion
  • Programmable loop accelerators retain efficiency
    while being programmable
  • Loop accelerator datapath generalized in a
    cost-effective way
  • Significant benefits over GPP
  • 4x-34x improved power efficiency
  • 30x improved area efficiency

25
Questions?
  • ?

http//cccp.eecs.umich.edu
26
(No Transcript)
27
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com