High Speed FIR Filter Implementation Using Add and Shift Method - PowerPoint PPT Presentation

About This Presentation
Title:

High Speed FIR Filter Implementation Using Add and Shift Method

Description:

Convolution of the latest L input samples. ... DA (Distributed Arithmetic) Implementation ... A Serial DA Filter Block Diagram. ICCD 2006. n 1 clock cycles are ... – PowerPoint PPT presentation

Number of Views:391
Avg rating:3.0/5.0
Slides: 26
Provided by: csewe4
Learn more at: https://cseweb.ucsd.edu
Category:

less

Transcript and Presenter's Notes

Title: High Speed FIR Filter Implementation Using Add and Shift Method


1
High Speed FIR Filter Implementation Using Add
and Shift Method
  • Shahnam Mirzaei, Anup Hosangadi, Ryan Kastner
  • University of California, Santa Barbara
  • ICCD 2006
  • San Jose, California
  • October 2006

UC Santa Barbara
ICCD 2006
2
Outline
  • Introduction
  • FIR filter implementation
  • Traditional Methods
  • MAC (Multiply Accumulate) implementation
  • DA (Distributed Arithmetic) implementation
  • New method
  • Add and Shift method and CSE (Common
    Subexpresssion Elimination)
  • Experiments and results
  • Resource utilization
  • Power consumption
  • Conclusion

UC Santa Barbara
ICCD 2006
3
Introduction
  • Extensive use of FPGAs in computationally
    intensive applications such as DSP
  • More available logic resources in current FPGAs
  • Broad applications of FIR filters in multimedia
    and communications
  • Need to efficient design methods to save
    area/power
  • Research motivation
  • Develop a more efficient implementation method
    for FIR filters that consumes less area at
    comparable performance.
  • Develop a unified tool for performing redundancy
    elimination, scheduling and module assignment.
  • Perform physically aware optimizations.
  • Architecture design exploration for ASIC and FPGA
    implementations (Distributed Arithmetic based,
    adder-shifter based, multiplier-adder based).

UC Santa Barbara
ICCD 2006
4
FIR FilterMAC Implementation
  • L tap FIR filter
  • Convolution of the latest L input samples. L is
    the number of coefficients h(k) of the filter,
    and x(n) represents the input time series.
    yn ? hk xn-k k 0,
    1, ..., L-1

 
  • Disadvantages
  • Large area on FPGA due to multipliers and the
    fact that full flexibility of general purpose
    multipliers are not required
  • Limited number of embedded resources such as MAC
    engines, multipliers, etc. in FPGAs

UC Santa Barbara
ICCD 2006
5
FIR FilterDA (Distributed Arithmetic)
Implementation
  • An alternative to MAC implementation which is the
    most common FPGA FIR implementation due to the
    LUT rich architecture of FPGAs.
  • yn ? cn xn n 0, 1, , N-1
  • Variable xn can be represented by
  • x n ? xb n 2b b0, 1, , B-1
  • xb n
    0, 1
  • where xb n is the bth bit of xn and B is the
    input width. The inner product can be rewritten
    as follows

UC Santa Barbara
ICCD 2006
6
FIR FilterDA (Distributed Arithmetic)
Implementation (contd)
  • y ? cn ? xb k 2b
  • c0 (xB-1 02B-1 xB-2 0 2B-2
    x0 020 )
  • c1 (xB-1 1 2B-1 xB-2 1 2B-2
    x0 1 20 )
  • cN-1 (xB-1 N-1 2B-1 xB-2 0 2B-2
    x0 N-1 20 )
  • (c0 xB-1 0 c1 xB-1 1 cN-1
    xB-1 N-1) 2B-1
  • (c0 xB-1 0 c1 xB-2 1
    cN-1 xB-2 N-1) 2B-2
  • (c0 x0 0 c1 x0 1 cN-1
    x0 N-1) 20
  • ? 2b ? cn xb k
  • where n0, 1, , N-1 and b0, 1, , B-1

UC Santa Barbara
ICCD 2006
7
DA (Distributed Arithmetic) ImplementationSerial

A Serial DA Filter Block Diagram
  • n1 clock cycles are needed for an n
  • but input symmetrical filter to
  • generate the output.
  • Performance is limited by the fact
  • that the next input sample can be
  • processed only after every bit of the
  • current input samples are processed
  • The tradeoff here is performance for
  • area

Address Data
0000 0
0001 C0
0010 C1

1111 C0C1C2C3
UC Santa Barbara
ICCD 2006
8
DA (Distributed Arithmetic) ImplementationParalle
l
  • The performance of the circuit can
  • be improved by modifying the
  • architecture to a parallel architecture
  • which processes the data bits in
  • groups
  • Increasing the number of bits
  • sampled has a significant effect on
  • resource utilization on FPGA.
  • More LUTs
  • Larger size scaling accumulator

A 2 bit parallel DA Filter Block Diagram
UC Santa Barbara
ICCD 2006
9
CSE (Common Subexpression Elimination)
  • Linear systems can be modeled using polynomials.
    Expressions consist of ,-,ltlt operators.
  • Polynomial formulation

C X ?(XLi)
(14)10 X (1110)2 X
Xltlt3 Xltlt2 Xltlt1 XL3
XL2 XL1
UC Santa Barbara
ICCD 2006
10
CSE Example
Y0 X0 X1 X2 X3 Y1 2X0 X1 X2
2X3 Y2 X0 X1 X2 X3 Y3 X0 2X1
2X2 X3
Y0 1 1 1 1 X0
Y1 2 1 -1 -2 X1
Y2 1 -1 -1 1 X2 Y3
1 -2 2 -1 X3
Y0 X0 X1 X2 X3 Y1 X0L X1 X2
X3L Y2 X0 X1 X2 X3 Y3 X0 X1L
X2L X3
UC Santa Barbara
ICCD 2006
11
CSE Example
  • D0 (X0 X3)
  • D1 (X1 X2)

Y0 X0 X1 X2 X3 Y1
X0L X1 - X2 - X3L Y2 X0 - X1 -
X2 X3 Y3 X0 - X1L X2L - X3
Y0 D0 X1 X2 Y1 X0L
X1 - X2 - X3L Y2 D0 - X1 - X2 Y3
X0 - X1L X2L - X3
UC Santa Barbara
ICCD 2006
12
CSE Example
  • D2 (X1 X2)
  • D3 (X0 X3)

Y0 D0 D2 Y1 X0L D1 - X3L Y2
D0 - D2 Y3 X0 - D1L - X3
UC Santa Barbara
ICCD 2006
13
CSE Example
12 additions 4 shifts
Y0 X0 X1 X2 X3 Y1
X0L X1 - X2 - X3L Y2 X0 - X1 -
X2 X3 Y3 X0 - X1L X2L - X3
D0 X0 X3 Y0 D0 D2 D1 X1
X2 Y1 D1 D3L D2 X1
X2 Y2 D0 - D2 D3 X0 -
X3 Y3 D3 D1L
8 additions 2 shifts
UC Santa Barbara
ICCD 2006
14
FIR Filter Add/Shift ImplementationReplacing
Constant Multiplication by Multiplier Block
 
UC Santa Barbara
ICCD 2006
15
FIR Filter Add/Shift ImplementationRegistered
Adder at no Additional Cost
UC Santa Barbara
ICCD 2006
16
Extracting Common Subexpressions
F1 A B C D F2 A B C E
Optimization
Extracting Common Expression (A B C)
Unoptimized Expression Trees
Extracting Common Expression (A B)
UC Santa Barbara
ICCD 2006
17
Synchronization
  • Extra registers are needed to
  • synchronize the intermediate values,
  • such that new values for A,B,C,D,E,F
  • can be read in every clock cycle

Calculating registers required for fastest
evaluation
UC Santa Barbara
ICCD 2006
18
Experiment ResultsResource Utilization/Performanc
e
Filter ( taps) Slices LUTs FFs Performance (Msps)
6 264 213 509 251
10 474 406 916 222
13 386 334 749 252
20 856 705 1650 250
28 1294 1145 2508 227
41 2154 1719 4161 223
61 3264 2591 6303 192
119 6009 4821 11551 203
151 7579 6098 14611 180
Filter ( taps) Slices LUTs FFs Performance (Msps)
6 524 774 1012 245
10 781 1103 1480 222
13 929 1311 1775 199
20 1191 1631 2288 199
28 1774 2544 3381 199
41 2475 3642 4748 222
61 3528 5335 6812 199
119 6484 9754 12539 205
151 8274 12525 15988 199
Filter Implementation Using Add and Shift Method
Filter Implementation Using Xilinx Coregen (PDA)
UC Santa Barbara
ICCD 2006
19
Experiment ResultsResource Utilization
UC Santa Barbara
ICCD 2006
20
Experiment ResultsPower Consumption
UC Santa Barbara
ICCD 2006
21
Creating MAC Filters Using Xilinx Coregen
UC Santa Barbara
ICCD 2006
22
Experiment ResultsComparison with MAC Filters
Using Multiplier Blocks
Filter ( taps) Add Shift Method Add Shift Method MAC filter MAC filter
Filter ( taps) Slices Msps Slices Msps
6 264 296 219 262
10 475 296 418 253
13 387 296 462 253
20 851 271 790 251
28 1303 305 886 251
41 2178 296 1660 243
61 3284 247 1947 242
119 6025 294 3581 241
151 7623 294 7631 215
UC Santa Barbara
ICCD 2006
23
Experiment ResultsComparison with MAC Filters
Using Multiplier Blocks Resource Utilization
UC Santa Barbara
ICCD 2006
24
Experiment ResultsComparison with MAC Filters
Using Multiplier Blocks - Performance
UC Santa Barbara
ICCD 2006
25
Conclusion/Observations
  • Presented a multiplierless technique, based on
    the add and shift method and common subexpression
    elimination for low area, low power and high
    speed implementations of FIR filters.
  • Validated our techniques on Virtex II/IV devices
    where we observed significant area and power
    reductions over traditional Distributed
    Arithmetic based techniques.
  • an average reduction of 58.7 in the number of
    LUTs, and about 25 reduction in the number of
    slices and FFs.
  • Better performance in most of the cases even
    though our algorithm does not optimize for
    performance
  • Observed up to 50 reduction in dynamic power
    consumption
  • Higher performance as the filter size increases.
  • Critical path in our design consists of adders
    while in MAC method, critical path consists of
    multipliers and adders.

UC Santa Barbara
ICCD 2006
Write a Comment
User Comments (0)
About PowerShow.com