High Speed FIR Filter Implementation Using Add and Shift Method - PowerPoint PPT Presentation

About This Presentation

Title:

High Speed FIR Filter Implementation Using Add and Shift Method

Description:

Convolution of the latest L input samples. ... DA (Distributed Arithmetic) Implementation ... A Serial DA Filter Block Diagram. ICCD 2006. n 1 clock cycles are ... – PowerPoint PPT presentation

Number of Views:391

Avg rating:3.0/5.0

Slides: 26

Provided by: csewe4

Learn more at: https://cseweb.ucsd.edu

Category:

more less

Transcript and Presenter's Notes

Title: High Speed FIR Filter Implementation Using Add and Shift Method

1
High Speed FIR Filter Implementation Using Add
and Shift Method

Shahnam Mirzaei, Anup Hosangadi, Ryan Kastner
University of California, Santa Barbara
ICCD 2006
San Jose, California
October 2006

UC Santa Barbara
ICCD 2006
2
Outline

Introduction
FIR filter implementation
Traditional Methods
MAC (Multiply Accumulate) implementation
DA (Distributed Arithmetic) implementation
New method
Add and Shift method and CSE (Common
Subexpresssion Elimination)
Experiments and results
Resource utilization
Power consumption
Conclusion

UC Santa Barbara
ICCD 2006
3
Introduction

Extensive use of FPGAs in computationally
intensive applications such as DSP
More available logic resources in current FPGAs
Broad applications of FIR filters in multimedia
and communications
Need to efficient design methods to save
area/power
Research motivation
Develop a more efficient implementation method
for FIR filters that consumes less area at
comparable performance.
Develop a unified tool for performing redundancy
elimination, scheduling and module assignment.
Perform physically aware optimizations.
Architecture design exploration for ASIC and FPGA
implementations (Distributed Arithmetic based,
adder-shifter based, multiplier-adder based).

UC Santa Barbara
ICCD 2006
4
FIR FilterMAC Implementation

L tap FIR filter
Convolution of the latest L input samples. L is
the number of coefficients h(k) of the filter,
and x(n) represents the input time series.
yn ? hk xn-k k 0,
1, ..., L-1

Disadvantages
Large area on FPGA due to multipliers and the
fact that full flexibility of general purpose
multipliers are not required
Limited number of embedded resources such as MAC
engines, multipliers, etc. in FPGAs

UC Santa Barbara
ICCD 2006
5
FIR FilterDA (Distributed Arithmetic)
Implementation

An alternative to MAC implementation which is the
most common FPGA FIR implementation due to the
LUT rich architecture of FPGAs.
yn ? cn xn n 0, 1, , N-1
Variable xn can be represented by
x n ? xb n 2b b0, 1, , B-1
xb n
0, 1
where xb n is the bth bit of xn and B is the
input width. The inner product can be rewritten
as follows

UC Santa Barbara
ICCD 2006
6
FIR FilterDA (Distributed Arithmetic)
Implementation (contd)

y ? cn ? xb k 2b
c0 (xB-1 02B-1 xB-2 0 2B-2
x0 020 )
c1 (xB-1 1 2B-1 xB-2 1 2B-2
x0 1 20 )
cN-1 (xB-1 N-1 2B-1 xB-2 0 2B-2
x0 N-1 20 )
(c0 xB-1 0 c1 xB-1 1 cN-1
xB-1 N-1) 2B-1
(c0 xB-1 0 c1 xB-2 1
cN-1 xB-2 N-1) 2B-2
(c0 x0 0 c1 x0 1 cN-1
x0 N-1) 20
? 2b ? cn xb k
where n0, 1, , N-1 and b0, 1, , B-1

UC Santa Barbara
ICCD 2006
7
DA (Distributed Arithmetic) ImplementationSerial

A Serial DA Filter Block Diagram

n1 clock cycles are needed for an n
but input symmetrical filter to
generate the output.
Performance is limited by the fact
that the next input sample can be
processed only after every bit of the
current input samples are processed
The tradeoff here is performance for
area

Address Data
0000 0
0001 C0
0010 C1

1111 C0C1C2C3
UC Santa Barbara
ICCD 2006
8
DA (Distributed Arithmetic) ImplementationParalle
l

The performance of the circuit can
be improved by modifying the
architecture to a parallel architecture
which processes the data bits in
groups
Increasing the number of bits
sampled has a significant effect on
resource utilization on FPGA.
More LUTs
Larger size scaling accumulator

A 2 bit parallel DA Filter Block Diagram
UC Santa Barbara
ICCD 2006
9
CSE (Common Subexpression Elimination)

Linear systems can be modeled using polynomials.
Expressions consist of ,-,ltlt operators.
Polynomial formulation

C X ?(XLi)
(14)10 X (1110)2 X
Xltlt3 Xltlt2 Xltlt1 XL3
XL2 XL1
UC Santa Barbara
ICCD 2006
10
CSE Example
Y0 X0 X1 X2 X3 Y1 2X0 X1 X2
2X3 Y2 X0 X1 X2 X3 Y3 X0 2X1
2X2 X3
Y0 1 1 1 1 X0
Y1 2 1 -1 -2 X1
Y2 1 -1 -1 1 X2 Y3
1 -2 2 -1 X3
Y0 X0 X1 X2 X3 Y1 X0L X1 X2
X3L Y2 X0 X1 X2 X3 Y3 X0 X1L
X2L X3
UC Santa Barbara
ICCD 2006
11
CSE Example

D0 (X0 X3)
D1 (X1 X2)

Y0 X0 X1 X2 X3 Y1
X0L X1 - X2 - X3L Y2 X0 - X1 -
X2 X3 Y3 X0 - X1L X2L - X3
Y0 D0 X1 X2 Y1 X0L
X1 - X2 - X3L Y2 D0 - X1 - X2 Y3
X0 - X1L X2L - X3
UC Santa Barbara
ICCD 2006
12
CSE Example

D2 (X1 X2)
D3 (X0 X3)

Y0 D0 D2 Y1 X0L D1 - X3L Y2
D0 - D2 Y3 X0 - D1L - X3
UC Santa Barbara
ICCD 2006
13
CSE Example
12 additions 4 shifts
Y0 X0 X1 X2 X3 Y1
X0L X1 - X2 - X3L Y2 X0 - X1 -
X2 X3 Y3 X0 - X1L X2L - X3
D0 X0 X3 Y0 D0 D2 D1 X1
X2 Y1 D1 D3L D2 X1
X2 Y2 D0 - D2 D3 X0 -
X3 Y3 D3 D1L
8 additions 2 shifts
UC Santa Barbara
ICCD 2006
14
FIR Filter Add/Shift ImplementationReplacing
Constant Multiplication by Multiplier Block

UC Santa Barbara
ICCD 2006
15
FIR Filter Add/Shift ImplementationRegistered
Adder at no Additional Cost
UC Santa Barbara
ICCD 2006
16
Extracting Common Subexpressions
F1 A B C D F2 A B C E
Optimization
Extracting Common Expression (A B C)
Unoptimized Expression Trees
Extracting Common Expression (A B)
UC Santa Barbara
ICCD 2006
17
Synchronization

Extra registers are needed to
synchronize the intermediate values,
such that new values for A,B,C,D,E,F
can be read in every clock cycle

Calculating registers required for fastest
evaluation
UC Santa Barbara
ICCD 2006
18
Experiment ResultsResource Utilization/Performanc
e
Filter ( taps) Slices LUTs FFs Performance (Msps)
6 264 213 509 251
10 474 406 916 222
13 386 334 749 252
20 856 705 1650 250
28 1294 1145 2508 227
41 2154 1719 4161 223
61 3264 2591 6303 192
119 6009 4821 11551 203
151 7579 6098 14611 180
Filter ( taps) Slices LUTs FFs Performance (Msps)
6 524 774 1012 245
10 781 1103 1480 222
13 929 1311 1775 199
20 1191 1631 2288 199
28 1774 2544 3381 199
41 2475 3642 4748 222
61 3528 5335 6812 199
119 6484 9754 12539 205
151 8274 12525 15988 199
Filter Implementation Using Add and Shift Method
Filter Implementation Using Xilinx Coregen (PDA)
UC Santa Barbara
ICCD 2006
19
Experiment ResultsResource Utilization
UC Santa Barbara
ICCD 2006
20
Experiment ResultsPower Consumption
UC Santa Barbara
ICCD 2006
21
Creating MAC Filters Using Xilinx Coregen
UC Santa Barbara
ICCD 2006
22
Experiment ResultsComparison with MAC Filters
Using Multiplier Blocks
Filter ( taps) Add Shift Method Add Shift Method MAC filter MAC filter
Filter ( taps) Slices Msps Slices Msps
6 264 296 219 262
10 475 296 418 253
13 387 296 462 253
20 851 271 790 251
28 1303 305 886 251
41 2178 296 1660 243
61 3284 247 1947 242
119 6025 294 3581 241
151 7623 294 7631 215
UC Santa Barbara
ICCD 2006
23
Experiment ResultsComparison with MAC Filters
Using Multiplier Blocks Resource Utilization
UC Santa Barbara
ICCD 2006
24
Experiment ResultsComparison with MAC Filters
Using Multiplier Blocks - Performance
UC Santa Barbara
ICCD 2006
25
Conclusion/Observations

Presented a multiplierless technique, based on
the add and shift method and common subexpression
elimination for low area, low power and high
speed implementations of FIR filters.
Validated our techniques on Virtex II/IV devices
where we observed significant area and power
reductions over traditional Distributed
Arithmetic based techniques.
an average reduction of 58.7 in the number of
LUTs, and about 25 reduction in the number of
slices and FFs.
Better performance in most of the cases even
though our algorithm does not optimize for
performance
Observed up to 50 reduction in dynamic power
consumption
Higher performance as the filter size increases.
Critical path in our design consists of adders
while in MAC method, critical path consists of
multipliers and adders.