Automatic Synthesis of Customized Local Memories for Multicluster Application Accelerators

About This Presentation

Title:

Automatic Synthesis of Customized Local Memories for Multicluster Application Accelerators

Description:

Electrical Engineering and Computer Science. Results for Solution 1. channel ... channel. huffman. LU. lyapunov. University of Michigan. Electrical Engineering ... – PowerPoint PPT presentation

Number of Views:32

Avg rating:3.0/5.0

Slides: 20

Provided by: manju7

Learn more at: https://cccp.eecs.umich.edu

Category:

more less

Transcript and Presenter's Notes

Title: Automatic Synthesis of Customized Local Memories for Multicluster Application Accelerators

1
Automatic Synthesis of Customized Local Memories
for Multicluster Application Accelerators

Manjunath Kudlur, Kevin Fan,
Michael Chu, Scott Mahlke
Advanced Computer Architecture Laboratory
University of Michigan

2
Motivation

Custom application accelerators (ASICs/ASIPs)
require careful data memory system design
Large volumes of data access at high bandwidth
Distributed local memories (scratchpads)
Achieves high bandwidth through parallel access
Low latency by placing data near computation
Custom memory design is complex
Multiple considerations bandwidth, size
requirements, data distribution
Decentralized datapath another monkey wrench

3
Background Our System

Synthesis of non-programmable accelerators
System similar to PICO (Program-In Chip-Out)
Input is Hot loop nest expressed in C
Throughput-directed synthesis
Required throughput expressed as II (initiation
interval)
Innermost loop modulo scheduled
Datapath derived directly from the schedule
FU allocation to meet II

4
Background Multicluster Datapath

FUs divided into clusters
Intercluster communication through global bus
Reduced wire lengths, reduced porting on register
file structures
Increased compiler complexity

Interconnection Network
C Program
Cluster 1
Cluster 2
Register FIFOs
Register FIFOs
FU
FU
MEM
MEM
FU
FU
MEM
MEM
Local Memories
Local Memories
5
Background Local Memories

SRAMs connected to MEM units in clusters
Data structures assigned to a single SRAM
Can be whole arrays, part of an array
Currently whole arrays considered
Multiple arrays can be combined in a single SRAM

Cluster 1
Register FIFOs
FU
FU
MEM
MEM
Local Memories
6
Problem Statement and Approach

Given a set of arrays, their sizes and
bitwidths, the corresponding loop nest, the
number of clusters and the target II, find an
allocation of arrays to SRAMs and allocation of
SRAMs to clusters such that overall cost is
minimized
Phase-ordered approach which handles 2 sub
problems separately
Memory synthesis
Operation partitioning

7
Combining Arrays

Combining arrays into a single SRAM reduces
hardware cost (row decoders, sense amps)
Issues with combining
Consider two arrays with (Bitwidth, Size) (B1,
S1) and (B2, S2)
Suppose A1 and A2 are number of static accesses
in the loop
Number of ports

MAX(B1, B2)
B1
B2
X
X
Y
S2
S1 S2
S1
Y
8
Combining Arrays

Multicluster issues
Can cause imbalance in operation distribution
All load store operations for the combined arrays
should be assigned to same cluster
Can increase inter cluster traffic
Address calculations and load-uses would cause
extra inter cluster moves

R1
R2

IC Move
LD
USE
9
Solution 1

Formulate the problem as an integer program
A binary decision variable X(i,j,k,l) to denote
assignment of array i to local memory j with
k ports on cluster l
Constraints to make sure inter cluster move
bandwidth is not violated
Perform operation partitioning and Modulo
schedule after memory synthesis

B
A
C
D
Input Arrays
Target II
Memory Synthesis
Operation Partitioning
Modulo Schedule
10
Experiments

System implemented in the Trimaran framework
Memory costs obtained from ARTISAN SRAM generator
scripts
lp_solve used to solve the integer programs
A set of DSP kernels evaluated
Loop oriented
Many arrays accessed in the loops

11
Results for Solution 1
huffman
channel
Target Initiation Interval (II)
Target Initiation Interval (II)
LU
lyapunov
Target Initiation Interval (II)
Target Initiation Interval (II)
12
Achieved II in Solution 1

Solution 1 eagerly combines arrays
Potential increase in inter cluster moves due to
imbalance in distribution of LD/ST ops
Achieved II poor due to IC moves in recurrence
cycles

Best II achieved
13
Solution 2

Phase-ordered approach
Two highly intertwined decisions allocation of
local memories and partitioning of operations
Three phases
Pre-Partitioning
Memory Synthesis
Operation Partitioning

14
Pre-Partitioning

Performance-oriented operation partitioning
Memory operations accessing the same arrays are
bound to same cluster
Consequently, arrays are bound to clusters

A
C
B
D
E
Cluster 2
Cluster 1
Pre-Partitioning
15
Memory Synthesis

ILP used to optimally combine arrays within
clusters
Pre-partitioning effectively disables combining
of arrays that cause operation imbalance

D
A
B
A
C
B
D
E
C
E
Cluster 1
Cluster 2
Cluster 2
Cluster 1
Memory Synthesis
16
Results for Solution 2
channel
huffman
Target Initiation Interval (II)
Target Initiation Interval (II)
LU
lyapunov
Target Initiation Interval (II)
Target Initiation Interval (II)
17
Achieved II for Solution 2