Title: High Speed FIR Filter Implementation Using Add and Shift Method
1 High Speed FIR Filter Implementation Using Add
and Shift Method
- Shahnam Mirzaei, Anup Hosangadi, Ryan Kastner
- University of California, Santa Barbara
- ICCD 2006
- San Jose, California
- October 2006
UC Santa Barbara
ICCD 2006
2Outline
- Introduction
- FIR filter implementation
- Traditional Methods
- MAC (Multiply Accumulate) implementation
- DA (Distributed Arithmetic) implementation
- New method
- Add and Shift method and CSE (Common
Subexpresssion Elimination) - Experiments and results
- Resource utilization
- Power consumption
- Conclusion
UC Santa Barbara
ICCD 2006
3Introduction
- Extensive use of FPGAs in computationally
intensive applications such as DSP - More available logic resources in current FPGAs
- Broad applications of FIR filters in multimedia
and communications - Need to efficient design methods to save
area/power - Research motivation
- Develop a more efficient implementation method
for FIR filters that consumes less area at
comparable performance. - Develop a unified tool for performing redundancy
elimination, scheduling and module assignment. - Perform physically aware optimizations.
- Architecture design exploration for ASIC and FPGA
implementations (Distributed Arithmetic based,
adder-shifter based, multiplier-adder based).
UC Santa Barbara
ICCD 2006
4FIR FilterMAC Implementation
- L tap FIR filter
- Convolution of the latest L input samples. L is
the number of coefficients h(k) of the filter,
and x(n) represents the input time series.
yn ? hk xn-k k 0,
1, ..., L-1
- Disadvantages
- Large area on FPGA due to multipliers and the
fact that full flexibility of general purpose
multipliers are not required - Limited number of embedded resources such as MAC
engines, multipliers, etc. in FPGAs -
UC Santa Barbara
ICCD 2006
5FIR FilterDA (Distributed Arithmetic)
Implementation
- An alternative to MAC implementation which is the
most common FPGA FIR implementation due to the
LUT rich architecture of FPGAs. - yn ? cn xn n 0, 1, , N-1
- Variable xn can be represented by
- x n ? xb n 2b b0, 1, , B-1
- xb n
0, 1 -
- where xb n is the bth bit of xn and B is the
input width. The inner product can be rewritten
as follows -
UC Santa Barbara
ICCD 2006
6FIR FilterDA (Distributed Arithmetic)
Implementation (contd)
- y ? cn ? xb k 2b
- c0 (xB-1 02B-1 xB-2 0 2B-2
x0 020 ) - c1 (xB-1 1 2B-1 xB-2 1 2B-2
x0 1 20 ) -
- cN-1 (xB-1 N-1 2B-1 xB-2 0 2B-2
x0 N-1 20 ) - (c0 xB-1 0 c1 xB-1 1 cN-1
xB-1 N-1) 2B-1 - (c0 xB-1 0 c1 xB-2 1
cN-1 xB-2 N-1) 2B-2 -
- (c0 x0 0 c1 x0 1 cN-1
x0 N-1) 20 - ? 2b ? cn xb k
- where n0, 1, , N-1 and b0, 1, , B-1
-
UC Santa Barbara
ICCD 2006
7DA (Distributed Arithmetic) ImplementationSerial
A Serial DA Filter Block Diagram
- n1 clock cycles are needed for an n
- but input symmetrical filter to
- generate the output.
- Performance is limited by the fact
- that the next input sample can be
- processed only after every bit of the
- current input samples are processed
- The tradeoff here is performance for
- area
-
Address Data
0000 0
0001 C0
0010 C1
1111 C0C1C2C3
UC Santa Barbara
ICCD 2006
8DA (Distributed Arithmetic) ImplementationParalle
l
- The performance of the circuit can
- be improved by modifying the
- architecture to a parallel architecture
- which processes the data bits in
- groups
- Increasing the number of bits
- sampled has a significant effect on
- resource utilization on FPGA.
- More LUTs
- Larger size scaling accumulator
-
A 2 bit parallel DA Filter Block Diagram
UC Santa Barbara
ICCD 2006
9CSE (Common Subexpression Elimination)
- Linear systems can be modeled using polynomials.
Expressions consist of ,-,ltlt operators. - Polynomial formulation
C X ?(XLi)
(14)10 X (1110)2 X
Xltlt3 Xltlt2 Xltlt1 XL3
XL2 XL1
UC Santa Barbara
ICCD 2006
10CSE Example
Y0 X0 X1 X2 X3 Y1 2X0 X1 X2
2X3 Y2 X0 X1 X2 X3 Y3 X0 2X1
2X2 X3
Y0 1 1 1 1 X0
Y1 2 1 -1 -2 X1
Y2 1 -1 -1 1 X2 Y3
1 -2 2 -1 X3
Y0 X0 X1 X2 X3 Y1 X0L X1 X2
X3L Y2 X0 X1 X2 X3 Y3 X0 X1L
X2L X3
UC Santa Barbara
ICCD 2006
11CSE Example
Y0 X0 X1 X2 X3 Y1
X0L X1 - X2 - X3L Y2 X0 - X1 -
X2 X3 Y3 X0 - X1L X2L - X3
Y0 D0 X1 X2 Y1 X0L
X1 - X2 - X3L Y2 D0 - X1 - X2 Y3
X0 - X1L X2L - X3
UC Santa Barbara
ICCD 2006
12CSE Example
Y0 D0 D2 Y1 X0L D1 - X3L Y2
D0 - D2 Y3 X0 - D1L - X3
UC Santa Barbara
ICCD 2006
13CSE Example
12 additions 4 shifts
Y0 X0 X1 X2 X3 Y1
X0L X1 - X2 - X3L Y2 X0 - X1 -
X2 X3 Y3 X0 - X1L X2L - X3
D0 X0 X3 Y0 D0 D2 D1 X1
X2 Y1 D1 D3L D2 X1
X2 Y2 D0 - D2 D3 X0 -
X3 Y3 D3 D1L
8 additions 2 shifts
UC Santa Barbara
ICCD 2006
14FIR Filter Add/Shift ImplementationReplacing
Constant Multiplication by Multiplier Block
UC Santa Barbara
ICCD 2006
15FIR Filter Add/Shift ImplementationRegistered
Adder at no Additional Cost
UC Santa Barbara
ICCD 2006
16Extracting Common Subexpressions
F1 A B C D F2 A B C E
Optimization
Extracting Common Expression (A B C)
Unoptimized Expression Trees
Extracting Common Expression (A B)
UC Santa Barbara
ICCD 2006
17Synchronization
- Extra registers are needed to
- synchronize the intermediate values,
- such that new values for A,B,C,D,E,F
- can be read in every clock cycle
-
Calculating registers required for fastest
evaluation
UC Santa Barbara
ICCD 2006
18Experiment ResultsResource Utilization/Performanc
e
Filter ( taps) Slices LUTs FFs Performance (Msps)
6 264 213 509 251
10 474 406 916 222
13 386 334 749 252
20 856 705 1650 250
28 1294 1145 2508 227
41 2154 1719 4161 223
61 3264 2591 6303 192
119 6009 4821 11551 203
151 7579 6098 14611 180
Filter ( taps) Slices LUTs FFs Performance (Msps)
6 524 774 1012 245
10 781 1103 1480 222
13 929 1311 1775 199
20 1191 1631 2288 199
28 1774 2544 3381 199
41 2475 3642 4748 222
61 3528 5335 6812 199
119 6484 9754 12539 205
151 8274 12525 15988 199
Filter Implementation Using Add and Shift Method
Filter Implementation Using Xilinx Coregen (PDA)
UC Santa Barbara
ICCD 2006
19Experiment ResultsResource Utilization
UC Santa Barbara
ICCD 2006
20Experiment ResultsPower Consumption
UC Santa Barbara
ICCD 2006
21Creating MAC Filters Using Xilinx Coregen
UC Santa Barbara
ICCD 2006
22Experiment ResultsComparison with MAC Filters
Using Multiplier Blocks
Filter ( taps) Add Shift Method Add Shift Method MAC filter MAC filter
Filter ( taps) Slices Msps Slices Msps
6 264 296 219 262
10 475 296 418 253
13 387 296 462 253
20 851 271 790 251
28 1303 305 886 251
41 2178 296 1660 243
61 3284 247 1947 242
119 6025 294 3581 241
151 7623 294 7631 215
UC Santa Barbara
ICCD 2006
23Experiment ResultsComparison with MAC Filters
Using Multiplier Blocks Resource Utilization
UC Santa Barbara
ICCD 2006
24Experiment ResultsComparison with MAC Filters
Using Multiplier Blocks - Performance
UC Santa Barbara
ICCD 2006
25Conclusion/Observations
- Presented a multiplierless technique, based on
the add and shift method and common subexpression
elimination for low area, low power and high
speed implementations of FIR filters. - Validated our techniques on Virtex II/IV devices
where we observed significant area and power
reductions over traditional Distributed
Arithmetic based techniques. - an average reduction of 58.7 in the number of
LUTs, and about 25 reduction in the number of
slices and FFs. - Better performance in most of the cases even
though our algorithm does not optimize for
performance - Observed up to 50 reduction in dynamic power
consumption - Higher performance as the filter size increases.
- Critical path in our design consists of adders
while in MAC method, critical path consists of
multipliers and adders.
UC Santa Barbara
ICCD 2006