Title: Automatic Synthesis of Customized Local Memories for Multicluster Application Accelerators
1Automatic Synthesis of Customized Local Memories
for Multicluster Application Accelerators
- Manjunath Kudlur, Kevin Fan,
- Michael Chu, Scott Mahlke
- Advanced Computer Architecture Laboratory
- University of Michigan
2Motivation
- Custom application accelerators (ASICs/ASIPs)
require careful data memory system design - Large volumes of data access at high bandwidth
- Distributed local memories (scratchpads)
- Achieves high bandwidth through parallel access
- Low latency by placing data near computation
- Custom memory design is complex
- Multiple considerations bandwidth, size
requirements, data distribution - Decentralized datapath another monkey wrench
3Background Our System
- Synthesis of non-programmable accelerators
- System similar to PICO (Program-In Chip-Out)
- Input is Hot loop nest expressed in C
- Throughput-directed synthesis
- Required throughput expressed as II (initiation
interval) - Innermost loop modulo scheduled
- Datapath derived directly from the schedule
- FU allocation to meet II
4Background Multicluster Datapath
- FUs divided into clusters
- Intercluster communication through global bus
- Reduced wire lengths, reduced porting on register
file structures - Increased compiler complexity
Interconnection Network
C Program
Cluster 1
Cluster 2
Register FIFOs
Register FIFOs
FU
FU
MEM
MEM
FU
FU
MEM
MEM
Local Memories
Local Memories
5Background Local Memories
- SRAMs connected to MEM units in clusters
- Data structures assigned to a single SRAM
- Can be whole arrays, part of an array
- Currently whole arrays considered
- Multiple arrays can be combined in a single SRAM
Cluster 1
Register FIFOs
FU
FU
MEM
MEM
Local Memories
6Problem Statement and Approach
- Given a set of arrays, their sizes and
bitwidths, the corresponding loop nest, the
number of clusters and the target II, find an
allocation of arrays to SRAMs and allocation of
SRAMs to clusters such that overall cost is
minimized - Phase-ordered approach which handles 2 sub
problems separately - Memory synthesis
- Operation partitioning
7Combining Arrays
- Combining arrays into a single SRAM reduces
hardware cost (row decoders, sense amps) - Issues with combining
- Consider two arrays with (Bitwidth, Size) (B1,
S1) and (B2, S2) - Suppose A1 and A2 are number of static accesses
in the loop - Number of ports
MAX(B1, B2)
B1
B2
X
X
Y
S2
S1 S2
S1
Y
8Combining Arrays
- Multicluster issues
- Can cause imbalance in operation distribution
- All load store operations for the combined arrays
should be assigned to same cluster - Can increase inter cluster traffic
- Address calculations and load-uses would cause
extra inter cluster moves
R1
R2
IC Move
LD
USE
9Solution 1
- Formulate the problem as an integer program
- A binary decision variable X(i,j,k,l) to denote
assignment of array i to local memory j with
k ports on cluster l - Constraints to make sure inter cluster move
bandwidth is not violated - Perform operation partitioning and Modulo
schedule after memory synthesis
B
A
C
D
Input Arrays
Target II
Memory Synthesis
Operation Partitioning
Modulo Schedule
10Experiments
- System implemented in the Trimaran framework
- Memory costs obtained from ARTISAN SRAM generator
scripts - lp_solve used to solve the integer programs
- A set of DSP kernels evaluated
- Loop oriented
- Many arrays accessed in the loops
11Results for Solution 1
huffman
channel
Target Initiation Interval (II)
Target Initiation Interval (II)
LU
lyapunov
Target Initiation Interval (II)
Target Initiation Interval (II)
12Achieved II in Solution 1
- Solution 1 eagerly combines arrays
- Potential increase in inter cluster moves due to
imbalance in distribution of LD/ST ops - Achieved II poor due to IC moves in recurrence
cycles
Best II achieved
13Solution 2
- Phase-ordered approach
- Two highly intertwined decisions allocation of
local memories and partitioning of operations - Three phases
- Pre-Partitioning
- Memory Synthesis
- Operation Partitioning
14Pre-Partitioning
- Performance-oriented operation partitioning
- Memory operations accessing the same arrays are
bound to same cluster - Consequently, arrays are bound to clusters
A
C
B
D
E
Cluster 2
Cluster 1
Pre-Partitioning
15Memory Synthesis
- ILP used to optimally combine arrays within
clusters - Pre-partitioning effectively disables combining
of arrays that cause operation imbalance
D
A
B
A
C
B
D
E
C
E
Cluster 1
Cluster 2
Cluster 2
Cluster 1
Memory Synthesis
16Results for Solution 2
channel
huffman
Target Initiation Interval (II)
Target Initiation Interval (II)
LU
lyapunov
Target Initiation Interval (II)
Target Initiation Interval (II)
17Achieved II for Solution 2
- Cost of synthesized memory not substantially
different - But achieved II is 36 better with
pre-partitioning
Best II achieved
18Conclusion
- An approach for synthesizing custom local
memories - ILP based optimal solution
- Works for clustered datapath
- Pre-partitioning to improve achieved throughput,
with minimal impact on cost - For more information
- http//cccp.eecs.umich.edu
19Example