Bridging the Computation Gap Between Programmable Processors and Hardwired Accelerators - PowerPoint PPT Presentation

About This Presentation

Title:

Bridging the Computation Gap Between Programmable Processors and Hardwired Accelerators

Description:

Bridging the Computation Gap Between Programmable Processors and ... Generalized FUs MOVs. Point-to-point. Bus Port-swapping. Limited size, no addr. ... – PowerPoint PPT presentation

Number of Views:94

Avg rating:3.0/5.0

Slides: 28

Provided by: cccpEec

Learn more at: https://cccp.eecs.umich.edu

Category:

more less

Transcript and Presenter's Notes

Title: Bridging the Computation Gap Between Programmable Processors and Hardwired Accelerators

1
Bridging the Computation Gap Between Programmable
Processors and Hardwired Accelerators

Kevin Fan1, Manjunath Kudlur2,
Ganesh Dasika, Scott Mahlke

University of Michigan Advanced Computer
Architecture Laboratory
1Parakinetics, Inc.
2Nvidia
2
Introduction
MPEG-4 Decoder
Cell-phone battery life (hours)
Frames/sec

Emerging applications have high performance,
cost, energy demands
High-quality video
Flash animation
Clear need for application and domain-specific
hardware

3
Flexibility

Multiple instances of the same application
E.g multiple video codecs
Software algorithms change over time
NRE
Time-to-market

Xvid
DivX
FFMpeg
4
ASIC Alternatives
FPGAs
Highly efficient, some programmability
General PurposeProcessors
DSPs
Domain-specific accelerators
Flexibility
???
Loop Accelerators, ASICs
Efficiency, Performance
5
How much post-programmabilityis really required?
mdct.c in faad2
for(k0 kltN4 k) ... real Z1k0
img Z1k1 Z1k0 real
sincosk0 - imgsincosk1
Z1k0 Z1k0 ltlt 1
for(k0 kltN4 k) ... real Z1k0
img Z1k1 Z1k0 real
sincosk0 - imgsincosk1
Z1k0 Z1k0 ltlt 1 if(b_scale)
Z1k0 Z1k0 scale
Version 1.39
Version 1.40
5
6
How much post-programmabilityis really required?
mdct.c in faad2
for(k0 kltN4 k) ... uint16_t n k ltlt
1 ComplexMult(...) X_out n
RE(x) X_outN2 - 1 - n -IM(x) X_outN2
n IM(x) X_outN - 1 - n -RE(x)
for(k0 kltN4 k) ... uint16_t n k ltlt
1 ComplexMult(...) X_out n
-RE(x) X_outN2 - 1 - n IM(x) X_outN2
n -IM(x) X_outN - 1 - n RE(x)
Version 1.33
Version 1.34
6
7
How much post-programmabilityis really required?
H.264 reference implementation
for (coeff_ctr 0 coeff_ctr lt 64
coeff_ctr) ipos_scancoeff_ctr0
jpos_scancoeff_ctr1 run
ilev0 if (currMB-gtluma_transform_size_8x8
_flag input-gtsymbol_mode CAVLC)
MCcoeff MC(coeff_ctr)
runsMCcoeff m7
curr_resblock_y jblock_x level
iabs (m7i) if (img-gtAdaptiveRounding)
fadjust8x8jblock_xi 0
if (level ! 0) nonzero
TRUE if (currMB-gtluma_transform_size_8x8_
flag input-gtsymbol_mode CAVLC)
coeff_cost MAX_VALUE
img-gtcofACb8pl_offMCcoeff0scan_possMCcoef
f isignab(level,m7i)
img-gtcofACb8pl_offMCcoeff1scan_possMCcoef
f runsMCcoeff scan_pos
runsMCcoeff-1 else
coeff_cost MAX_VALUE
ACLevelscan_pos isignab(level,m7i)
ACRun scan_pos run
run-1 // reset zero level
counter level isignab(level,
m7i) ilev level
for (coeff_ctr 0 coeff_ctr lt 64
coeff_ctr) ipos_scancoeff_ctr0
jpos_scancoeff_ctr1 run
ilev0 if (currMB-gtluma_transform_size_8x8
_flag input-gtsymbol_mode CAVLC)
MCcoeff MC(coeff_ctr)
runsMCcoeff m7
curr_resblock_y jblock_x level
iabs (m7i) if (img-gtAdaptiveRounding)
fadjust8x8jblock_xi 0
if (level ! 0) nonzero
TRUE if (currMB-gtluma_transform_size_8x8_
flag input-gtsymbol_mode CAVLC)
coeff_cost MAX_VALUE
img-gtcofACpl_offMCcoeff0scan_possMCcoeff
isignab(level,m7i)
img-gtcofACpl_offMCcoeff1scan_possMCcoeff
runsMCcoeff scan_pos
runsMCcoeff-1 else
coeff_cost MAX_VALUE
ACLevelscan_pos isignab(level,m7i)
ACRun scan_pos run run-1
// reset zero level counter
level isignab(level, m7i)
ilev level
for (coeff_ctr 0 coeff_ctr lt 64
coeff_ctr) ipos_scancoeff_ctr0
jpos_scancoeff_ctr1 run
ilev0 if (currMB-gtluma_transform_size_8x8
_flag input-gtsymbol_mode CAVLC)
MCcoeff MC(coeff_ctr)
runsMCcoeff m7
curr_resblock_y jblock_x level
iabs (m7i) if (img-gtAdaptiveRounding)
fadjust8x8jblock_xi 0
if (level ! 0) nonzero
TRUE if (currMB-gtluma_transform_size_8x8_
flag input-gtsymbol_mode CAVLC)
coeff_cost MAX_VALUE
img-gtcofACpl_offMCcoeff0scan_possMCcoeff
isignab(level,m7i)
img-gtcofACpl_offMCcoeff1scan_possMCcoeff
runsMCcoeff scan_pos
runsMCcoeff-1 else
coeff_cost MAX_VALUE
ACLevelscan_pos isignab(level,m7i)
ACRun scan_pos run run-1
// reset zero level counter
level isignab(level, m7i)
ilev level
Version 13.0
Version 13.1
Version 13.2

Mostly minor changes to loops
Bug fixes
Revisions
Possible to design custom HW with minor
programmability extensions

7
8
Programmable Loop Accelerator

Generalize accelerator without losing efficiency

FPGAs
General PurposeProcessors
DSPs
Domain-specific accelerators
Flexibility
???
Programmable Loop Accelerators
Loop Accelerators, ASICs
Efficiency, Performance
9
Designing Loop Accelerators
10
Loop Accelerator Architecture
CRF
Point-to-point Connections

FSM
Local Mem

MEM
BR
Controlsignals
Hardware realization of modulo scheduled
loop Parameterized execution resources, storage,
connectivity
11
LA Scheduling
FIR Loop Kernel
Mult result has longer lifetime
Paths missing
No subtract
????
12
LA Datapath Restrictions
8
Slow-Down
Graph Difference
12
13
Programmable Loop-Accelerator Architecture
CRF
Point-to-point Connections

Control Memory
FSM
Local Mem
/-
/
MEM
BR

Controlsignals
RR
RR
RR
RR
SRF
SRF
SRF
SRF
LA
PLA

Functionality
Storage
Connectivity
Control

Custom FU set
Generalized FUs MOVs
Limited size, no addr.
Rotating Reg. Files
Point-to-point
Bus Port-swapping
Hardwired Control
Lit. Reg. File Control Mem
14
Experimental Setup

Wide variety of benchmarks
DSP
Media
Linear Algebra
Baseline LAs
Used LA synthesis system to generate HDL
200 MHz _at_ 0.13um
Comparisons
PLAs (200 MHz _at_ 0.13um)
OR-1200 (300 MHz _at_ 0.13um)

15
Area
OR-1200
16
Power Consumption
OR-1200
17
Power Consumption
OR-1200
OR-1200 equiv
18
Power Breakdown
19
Scheduling for PLAs
SynthesisSystem
Hardware
Compiler SMT-solver
Loop 2

Generalize accelerator architecture
Map new loops to existing hardware

20
PLA Scheduling
21
PLA Scheduling
SMT
22
Programmability
Small, with complex communication
Small, with simple communication
23
Power Efficiency
LA 105 MIPS/mW
PLA 24 MIPS/mW
Tensilica Diamond Core 12 MIPS/mW
Performance (MIPS)
TI C6x 5 MIPS/mW
ARM11 3 MIPS/mW
OR1K 2 MIPS/mW
Itanium2 0.08 MIPS/mW
Power (mW)
24
Conclusion