Storage Assignment during High-level Synthesis for Configurable Architectures - PowerPoint PPT Presentation

About This Presentation
Title:

Storage Assignment during High-level Synthesis for Configurable Architectures

Description:

Storage Assignment. during High-level Synthesis for Configurable Architectures ... Configurable Logic Blocks. 11/7/2005. GONG et al: Storage Assignment. 11 ... – PowerPoint PPT presentation

Number of Views:1461
Avg rating:3.0/5.0
Slides: 32
Provided by: Wenru5
Learn more at: https://cseweb.ucsd.edu
Category:

less

Transcript and Presenter's Notes

Title: Storage Assignment during High-level Synthesis for Configurable Architectures


1
Storage Assignment during High-level Synthesis
for Configurable Architectures
  • Wenrui Gong Gang Wang Ryan Kastner
  • Department of Electrical and Computer
    EngineeringUniversity of California, Santa
    Barbara
  • gong, wanggang, kastner_at_ece.ucsb.edu
  • http//express.ece.ucsb.edu
  • November 7, 2005

2
What are we dealing with?
  • FPGA-based reconfigurable architectures
  • with distributed block RAM modules
  • Synthesizing high-level programs into designs

Block RAM
Block RAM
Configurable Logic Blocks
Block RAM
Block RAM
3
Options of Storage Assignment
  • Given the same storage/logic resources, different
    storage assignments exist

OR
MUX
datapath
datapath
datapath
control logic
control logic
datapath
datapath
4
Objective
  • Different arrangements achieve different
    performances.
  • Objective achieve the best performance
    (throughput) under the resource constraints,
    improve resource utilizations, and meet design
    goals (time, frequencies, etc.)

5
Outline
  • Target architectures
  • Data partitioning problem
  • Memory optimizations
  • Experimental results
  • Concluding remarks

6
Outline
  • Target architectures
  • Data partitioning problem
  • Memory optimizations
  • Experimental results
  • Concluding remarks

7
Target Architecture
  • FPGA-based fine-grained reconfigurable computing
    architecture with distributed block RAM modules

8
Memory Access Latencies
  • Memory access delay BRAM access delay
    interconnect delays
  • BRAM access time are fixed with the architecture
  • Interconnect delays are variables.
  • One clock cycle to access near data, or two or
    even more to access data far away from the CLB.
  • Difficult to precisely estimate execution time.

9
Outline
  • Target architectures
  • Data partitioning problem
  • Problem formulation
  • Data partitioning algorithm
  • Memory optimizations
  • Experimental results
  • Concluding remarks

10
Problem Formulation
  • Inputs
  • An l-level nested loop L
  • A set of n data arrays N
  • An architecture with BRAM modules M.
  • Partitioning problem partition data arrays N
    into a set of data portions P, and seek an
    assignment from P to block RAM modules M.
  • Objective optimize latency

Block RAM
Block RAM
Configurable Logic Blocks
Block RAM
Block RAM
11
Overview of Data Partitioning Algorithm
  • Code analysis
  • Determine possible partitioning directions
  • Architectural-level synthesis
  • Resource allocation, scheduling and binding
  • Discover the design properties
  • Granularity adjustment
  • Use experimental cost function to estimate
    performances

12
Code Analysis
  • Iteration space and data spaces
  • Index functions determine access footprints

iteration space
data space S
13
Iteration/Data Space Partitioning
  • Partitioning on the iteration space derive
    corresponding partitioning on data spaces
  • Using the index functions
  • Communication-free partitioning

iteration space
data space S
14
Iteration/Data Space Partitioning
  • Communication-efficient partitioning
  • Data access footprints overlapped
  • The reason of remote memory accesses, when not
    placed together

iteration space
data space S
15
Architectural-level Synthesis
  • Synthesize the innermost iteration body
  • Pipelining designs
  • Collect performance results
  • execution time T,
  • initial intervals II,
  • and resource utilization umul, uBRAM, and uCLB

16
Estimating the Execution Time
  • Resource utilizations determine the performance
    of the pipelined designs
  • Execution time are linear to the number of
    initial intervals and the granularity.
  • When more resources are not occupied, more
    operations could be performed simultaneously.

17
Granularity Adjustment
  • For each possible partitioning direction, check
    different granularity to obtain the best
    performance
  • Coarsest use as less block RAM modules as
    possible

18
Granularity Adjustment
  • For each possible partitioning direction, check
    different granularity to obtain the best
    performance
  • Finest distribute data to all block RAM modules

19
Cost Function
  • An experiential formulation based our
    architectural-level synthesis results.
  • Estimate global memory accesses mr and total
    memory accesses mt, and their ratio
  • Factor benefits memory accesses to nearby
    block RAM modules

20
Outline
  • Target architectures
  • Data partitioning problem
  • Memory optimizations
  • Scalar replacement
  • Data prefetching
  • Experimental results
  • Concluding remarks

21
Scalar Replacement
  • Scalar replacement increases data reuses and
    reduces memory access
  • Memory are accessed in the previous iteration
  • Use contents already in registers rather than
    access it again

22
Data Prefetching and Buffer Insertion
  • Buffer insertion reduces critical paths, and
    optimizes clock frequencies.
  • Schedule the global memory access one cycle
    earlier
  • One (two, or more) cycle depend on the size of
    chip and the of BRAM
  • Reduce the length of critical paths

23
Outline
  • Target architectures
  • Data partitioning problem
  • Memory optimizations
  • Experimental results
  • Concluding remarks

24
Experimental Setup
  • Target architecture Xilinx Virtex II FPGA.
  • Target frequency 150 MHz.
  • Benchmarks image processing applications and DSP
  • SOBEL edge detection
  • Bilinear filtering
  • 2D Gauss blurring
  • 1D Gauss filter
  • SUSAN principle

25
Collected Results
  • Pre-layout and post-layout timing and area
    results are collected
  • Original assign one block RAM to the entire data
    array
  • Partitioned the iteration/data spaces are
    partitioned under resource constraints.
  • Optimized memory optimizations applied on the
    partitioned designs.

26
Results Execution Time
  • The average speedup 2.75 times
  • Under given resources, partitioned to 4 portions.
  • After further optimizations 4.80 times faster.

27
Results Achievable Clock Frequencies
  • About 10 percent slower than the original ones.
    After optimizations, about 7 percent faster than
    those of partitioned ones.

28
Outline
  • Target architectures
  • Data partitioning problem
  • Memory optimizations
  • Experimental results
  • Concluding remarks

29
Concluding Remarks
  • A data and iteration space partitioning approach
    for homogeneous block RAM modules
  • integrated with existing architectural-level
    synthesis techniques
  • parallelize input designs
  • dramatically improve system performance

30
Thank You
  • Prof Ryan Kastner and Gang Wang
  • Reviewers
  • All audiences

31
Questions
Write a Comment
User Comments (0)
About PowerShow.com