Title: Storage Assignment during High-level Synthesis for Configurable Architectures
1Storage Assignment during High-level Synthesis
for Configurable Architectures
- Wenrui Gong Gang Wang Ryan Kastner
- Department of Electrical and Computer
EngineeringUniversity of California, Santa
Barbara - gong, wanggang, kastner_at_ece.ucsb.edu
- http//express.ece.ucsb.edu
- November 7, 2005
2What are we dealing with?
- FPGA-based reconfigurable architectures
- with distributed block RAM modules
- Synthesizing high-level programs into designs
Block RAM
Block RAM
Configurable Logic Blocks
Block RAM
Block RAM
3Options of Storage Assignment
- Given the same storage/logic resources, different
storage assignments exist
OR
MUX
datapath
datapath
datapath
control logic
control logic
datapath
datapath
4Objective
- Different arrangements achieve different
performances. - Objective achieve the best performance
(throughput) under the resource constraints,
improve resource utilizations, and meet design
goals (time, frequencies, etc.)
5Outline
- Target architectures
- Data partitioning problem
- Memory optimizations
- Experimental results
- Concluding remarks
6Outline
- Target architectures
- Data partitioning problem
- Memory optimizations
- Experimental results
- Concluding remarks
7Target Architecture
- FPGA-based fine-grained reconfigurable computing
architecture with distributed block RAM modules
8Memory Access Latencies
- Memory access delay BRAM access delay
interconnect delays - BRAM access time are fixed with the architecture
- Interconnect delays are variables.
- One clock cycle to access near data, or two or
even more to access data far away from the CLB. - Difficult to precisely estimate execution time.
9Outline
- Target architectures
- Data partitioning problem
- Problem formulation
- Data partitioning algorithm
- Memory optimizations
- Experimental results
- Concluding remarks
10Problem Formulation
- Inputs
- An l-level nested loop L
- A set of n data arrays N
- An architecture with BRAM modules M.
- Partitioning problem partition data arrays N
into a set of data portions P, and seek an
assignment from P to block RAM modules M. - Objective optimize latency
Block RAM
Block RAM
Configurable Logic Blocks
Block RAM
Block RAM
11Overview of Data Partitioning Algorithm
- Code analysis
- Determine possible partitioning directions
- Architectural-level synthesis
- Resource allocation, scheduling and binding
- Discover the design properties
- Granularity adjustment
- Use experimental cost function to estimate
performances
12Code Analysis
- Iteration space and data spaces
- Index functions determine access footprints
iteration space
data space S
13Iteration/Data Space Partitioning
- Partitioning on the iteration space derive
corresponding partitioning on data spaces - Using the index functions
- Communication-free partitioning
iteration space
data space S
14Iteration/Data Space Partitioning
- Communication-efficient partitioning
- Data access footprints overlapped
- The reason of remote memory accesses, when not
placed together
iteration space
data space S
15Architectural-level Synthesis
- Synthesize the innermost iteration body
- Pipelining designs
- Collect performance results
- execution time T,
- initial intervals II,
- and resource utilization umul, uBRAM, and uCLB
16Estimating the Execution Time
- Resource utilizations determine the performance
of the pipelined designs - Execution time are linear to the number of
initial intervals and the granularity. - When more resources are not occupied, more
operations could be performed simultaneously.
17Granularity Adjustment
- For each possible partitioning direction, check
different granularity to obtain the best
performance - Coarsest use as less block RAM modules as
possible
18Granularity Adjustment
- For each possible partitioning direction, check
different granularity to obtain the best
performance - Finest distribute data to all block RAM modules
19Cost Function
- An experiential formulation based our
architectural-level synthesis results. - Estimate global memory accesses mr and total
memory accesses mt, and their ratio - Factor benefits memory accesses to nearby
block RAM modules
20Outline
- Target architectures
- Data partitioning problem
- Memory optimizations
- Scalar replacement
- Data prefetching
- Experimental results
- Concluding remarks
21Scalar Replacement
- Scalar replacement increases data reuses and
reduces memory access - Memory are accessed in the previous iteration
- Use contents already in registers rather than
access it again
22Data Prefetching and Buffer Insertion
- Buffer insertion reduces critical paths, and
optimizes clock frequencies. - Schedule the global memory access one cycle
earlier - One (two, or more) cycle depend on the size of
chip and the of BRAM - Reduce the length of critical paths
23Outline
- Target architectures
- Data partitioning problem
- Memory optimizations
- Experimental results
- Concluding remarks
24Experimental Setup
- Target architecture Xilinx Virtex II FPGA.
- Target frequency 150 MHz.
- Benchmarks image processing applications and DSP
- SOBEL edge detection
- Bilinear filtering
- 2D Gauss blurring
- 1D Gauss filter
- SUSAN principle
25Collected Results
- Pre-layout and post-layout timing and area
results are collected - Original assign one block RAM to the entire data
array - Partitioned the iteration/data spaces are
partitioned under resource constraints. - Optimized memory optimizations applied on the
partitioned designs.
26Results Execution Time
- The average speedup 2.75 times
- Under given resources, partitioned to 4 portions.
- After further optimizations 4.80 times faster.
27Results Achievable Clock Frequencies
- About 10 percent slower than the original ones.
After optimizations, about 7 percent faster than
those of partitioned ones.
28Outline
- Target architectures
- Data partitioning problem
- Memory optimizations
- Experimental results
- Concluding remarks
29Concluding Remarks
- A data and iteration space partitioning approach
for homogeneous block RAM modules - integrated with existing architectural-level
synthesis techniques - parallelize input designs
- dramatically improve system performance
30Thank You
- Prof Ryan Kastner and Gang Wang
- Reviewers
- All audiences
31Questions