Automatic Synthesis of Customized Local Memories for Multicluster Application Accelerators - PowerPoint PPT Presentation

About This Presentation
Title:

Automatic Synthesis of Customized Local Memories for Multicluster Application Accelerators

Description:

Electrical Engineering and Computer Science. Results for Solution 1. channel ... channel. huffman. LU. lyapunov. University of Michigan. Electrical Engineering ... – PowerPoint PPT presentation

Number of Views:32
Avg rating:3.0/5.0
Slides: 20
Provided by: manju7
Category:

less

Transcript and Presenter's Notes

Title: Automatic Synthesis of Customized Local Memories for Multicluster Application Accelerators


1
Automatic Synthesis of Customized Local Memories
for Multicluster Application Accelerators
  • Manjunath Kudlur, Kevin Fan,
  • Michael Chu, Scott Mahlke
  • Advanced Computer Architecture Laboratory
  • University of Michigan

2
Motivation
  • Custom application accelerators (ASICs/ASIPs)
    require careful data memory system design
  • Large volumes of data access at high bandwidth
  • Distributed local memories (scratchpads)
  • Achieves high bandwidth through parallel access
  • Low latency by placing data near computation
  • Custom memory design is complex
  • Multiple considerations bandwidth, size
    requirements, data distribution
  • Decentralized datapath another monkey wrench

3
Background Our System
  • Synthesis of non-programmable accelerators
  • System similar to PICO (Program-In Chip-Out)
  • Input is Hot loop nest expressed in C
  • Throughput-directed synthesis
  • Required throughput expressed as II (initiation
    interval)
  • Innermost loop modulo scheduled
  • Datapath derived directly from the schedule
  • FU allocation to meet II

4
Background Multicluster Datapath
  • FUs divided into clusters
  • Intercluster communication through global bus
  • Reduced wire lengths, reduced porting on register
    file structures
  • Increased compiler complexity

Interconnection Network
C Program
Cluster 1
Cluster 2
Register FIFOs
Register FIFOs
FU
FU
MEM
MEM
FU
FU
MEM
MEM
Local Memories
Local Memories
5
Background Local Memories
  • SRAMs connected to MEM units in clusters
  • Data structures assigned to a single SRAM
  • Can be whole arrays, part of an array
  • Currently whole arrays considered
  • Multiple arrays can be combined in a single SRAM

Cluster 1
Register FIFOs
FU
FU
MEM
MEM
Local Memories
6
Problem Statement and Approach
  • Given a set of arrays, their sizes and
    bitwidths, the corresponding loop nest, the
    number of clusters and the target II, find an
    allocation of arrays to SRAMs and allocation of
    SRAMs to clusters such that overall cost is
    minimized
  • Phase-ordered approach which handles 2 sub
    problems separately
  • Memory synthesis
  • Operation partitioning

7
Combining Arrays
  • Combining arrays into a single SRAM reduces
    hardware cost (row decoders, sense amps)
  • Issues with combining
  • Consider two arrays with (Bitwidth, Size) (B1,
    S1) and (B2, S2)
  • Suppose A1 and A2 are number of static accesses
    in the loop
  • Number of ports

MAX(B1, B2)
B1
B2
X
X
Y
S2
S1 S2
S1
Y
8
Combining Arrays
  • Multicluster issues
  • Can cause imbalance in operation distribution
  • All load store operations for the combined arrays
    should be assigned to same cluster
  • Can increase inter cluster traffic
  • Address calculations and load-uses would cause
    extra inter cluster moves

R1
R2

IC Move
LD
USE
9
Solution 1
  • Formulate the problem as an integer program
  • A binary decision variable X(i,j,k,l) to denote
    assignment of array i to local memory j with
    k ports on cluster l
  • Constraints to make sure inter cluster move
    bandwidth is not violated
  • Perform operation partitioning and Modulo
    schedule after memory synthesis

B
A
C
D
Input Arrays
Target II
Memory Synthesis
Operation Partitioning
Modulo Schedule
10
Experiments
  • System implemented in the Trimaran framework
  • Memory costs obtained from ARTISAN SRAM generator
    scripts
  • lp_solve used to solve the integer programs
  • A set of DSP kernels evaluated
  • Loop oriented
  • Many arrays accessed in the loops

11
Results for Solution 1
huffman
channel
Target Initiation Interval (II)
Target Initiation Interval (II)
LU
lyapunov
Target Initiation Interval (II)
Target Initiation Interval (II)
12
Achieved II in Solution 1
  • Solution 1 eagerly combines arrays
  • Potential increase in inter cluster moves due to
    imbalance in distribution of LD/ST ops
  • Achieved II poor due to IC moves in recurrence
    cycles

Best II achieved
13
Solution 2
  • Phase-ordered approach
  • Two highly intertwined decisions allocation of
    local memories and partitioning of operations
  • Three phases
  • Pre-Partitioning
  • Memory Synthesis
  • Operation Partitioning

14
Pre-Partitioning
  • Performance-oriented operation partitioning
  • Memory operations accessing the same arrays are
    bound to same cluster
  • Consequently, arrays are bound to clusters

A
C
B
D
E
Cluster 2
Cluster 1
Pre-Partitioning
15
Memory Synthesis
  • ILP used to optimally combine arrays within
    clusters
  • Pre-partitioning effectively disables combining
    of arrays that cause operation imbalance

D
A
B
A
C
B
D
E
C
E
Cluster 1
Cluster 2
Cluster 2
Cluster 1
Memory Synthesis
16
Results for Solution 2
channel
huffman
Target Initiation Interval (II)
Target Initiation Interval (II)
LU
lyapunov
Target Initiation Interval (II)
Target Initiation Interval (II)
17
Achieved II for Solution 2
  • Cost of synthesized memory not substantially
    different
  • But achieved II is 36 better with
    pre-partitioning

Best II achieved
18
Conclusion
  • An approach for synthesizing custom local
    memories
  • ILP based optimal solution
  • Works for clustered datapath
  • Pre-partitioning to improve achieved throughput,
    with minimal impact on cost
  • For more information
  • http//cccp.eecs.umich.edu

19
Example
Write a Comment
User Comments (0)
About PowerShow.com