Streamroller: Compiler Orchestrated Synthesis of Accelerator Pipelines

About This Presentation

Title:

Streamroller: Compiler Orchestrated Synthesis of Accelerator Pipelines

Description:

Streamroller: Compiler Orchestrated Synthesis of Accelerator Pipelines ... Automated accelerator synthesis for whole application. Correct by construction ... – PowerPoint PPT presentation

Number of Views:27

Avg rating:3.0/5.0

Slides: 19

Provided by: fank

Learn more at: http://cccp.eecs.umich.edu

Category:

more less

Transcript and Presenter's Notes

Title: Streamroller: Compiler Orchestrated Synthesis of Accelerator Pipelines

1
Streamroller Compiler Orchestrated Synthesis of
Accelerator Pipelines

Manjunath Kudlur, Kevin Fan, Ganesh Dasika, and
Scott Mahlke
University of Michigan

2
Automated C to Gates Solution

SoC design
10-100 Gops, 200 mW power budget
Low level tools ineffective
Automated accelerator synthesis for whole
application
Correct by construction
Increase designer productivity
Faster time to market

3
Streaming Applications

Data streaming through kernels
Kernels are tight loops
FIR, Viterbi, DCT
Coarse grain dataflow between kernels
Sub-blocks of images, network packets

4
System Schema Overview
LA 1
Kernel 1
Task throughput
Kernel 2
Kernel 3
LA 2
time
Kernel 4
Kernel 5
LA 3
5
Input Specification

System specification
Function with main input/output
Local arrays to pass data
Sequence of calls to kernels

Sequential C program
Kernel specification
Perfectly nested FOR loop
Wrapped inside C function
All data access made explicit

row_trans(char inp88, char
out88 )
dct(char inp88, char out88)
for(i0 ilt8 i) for(j0 jlt8 j) .
. . inpij outij . . .
char tmp188, tmp288
row_trans(inp, tmp1) col_trans(tmp1, tmp2)
zigzag_trans(tmp2, out)

col_trans(char inp88, char
out88) zigzag_trans(char inp88,
char out88)
6
System Level Decisions

Throughput of each LA Initiation Interval
Grouping of loops into a multifunction LA
More loops in a single LA ? LA occupied for
longer time in current task

Throughput 1 task / 200 cycles
LA 2
LA 1 occupied for 200 cycles
LA 3
7
System Decisions (Contd..)

Cost of SRAM buffers for intermediate arrays
More buffers ? more task overlap ? high
performance

tmp1 buffer in use by LA2
Adjacent tasks use different buffers
8
Case Study Simple benchmark
LA 1
Loop graph
TC256
3
9
Prescribed Throughput Accelerators

Traditional behavioral synthesis
Directly translate C operatorsinto gates
Our approach Application-centric Architectures
Achieve fixed throughput
Maximize hardware sharing

Operation graph
Datapath
Application
Architecture
10
Loop Accelerator Template

Hardware realization of modulo scheduled loop

Parameterized execution resources, storage,
connectivity

11
Loop Accelerator Design Flow
FU Alloc
FU
FU
.c
RF
C Code, Performance (Throughput)
Abstract Arch
12
Multifunction Accelerator

Map multiple loops to single accelerator
Improve hardware efficiency via reuse
Opportunities for sharing
Disjoint stages(loops 2, 3)
Pipeline slack(loops 4, 5)

Loop 1
Frame Type?
Loop 2
Loop 3
Loop 4
Block 5

Application
13
Union
Cost SensitiveModulo Scheduler
FU
FU
Loop 1
Cost SensitiveModulo Scheduler
Loop 2
FU
FU