Title: High Speed FIR Filter Implementation Using Add and Shift Method
1 High Speed FIR Filter Implementation Using Add
and Shift Method
- Shahnam Mirzaei, Anup Hosangadi, Ryan Kastner
- University of California, Santa Barbara
- ICCD 2006
- San Jose, California
- October 2006
UC Santa Barbara
ICCD 2006
2Outline
- Introduction
- FIR filter implementation
- Traditional Methods
- MAC (Multiply Accumulate) implementation
- DA (Distributed Arithmetic) implementation
- New method
- Add and Shift method and CSE (Common
Subexpresssion Elimination) - Experiments and results
- Resource utilization
- Power consumption
- Conclusion
UC Santa Barbara
ICCD 2006
3Introduction
- Extensive use of FPGAs in computationally
intensive applications such as DSP - More available logic resources in current FPGAs
- Broad applications of FIR filters in multimedia
and communications - Need to efficient design methods to save
area/power - Research motivation
- Develop a more efficient implementation method
for FIR filters that consumes less area at
comparable performance. - Develop a unified tool for performing redundancy
elimination, scheduling and module assignment. - Perform physically aware optimizations.
- Architecture design exploration for ASIC and FPGA
implementations (Distributed Arithmetic based,
adder-shifter based, multiplier-adder based).
UC Santa Barbara
ICCD 2006
4FIR FilterMAC Implementation
- L tap FIR filter
- Convolution of the latest L input samples. L is
the number of coefficients h(k) of the filter,
and x(n) represents the input time series.
yn ? hk xn-k k 0,
1, ..., L-1
- Disadvantages
- Large area on FPGA due to multipliers and the
fact that full flexibility of general purpose
multipliers are not required - Limited number of embedded resources such as MAC
engines, multipliers, etc. in FPGAs -
UC Santa Barbara
ICCD 2006
5FIR FilterDA (Distributed Arithmetic)
Implementation
- An alternative to MAC implementation which is the
most common FPGA FIR implementation due to the
LUT rich architecture of FPGAs. - yn ? cn xn n 0, 1, , N-1
- Variable xn can be represented by
- x n ? xb n 2b b0, 1, , B-1
- xb n
0, 1 -
- where xb n is the bth bit of xn and B is the
input width. The inner product can be rewritten
as follows -
UC Santa Barbara
ICCD 2006
6FIR FilterDA (Distributed Arithmetic)
Implementation (contd)
- y ? cn ? xb k 2b
- c0 (xB-1 02B-1 xB-2 0 2B-2
x0 020 ) - c1 (xB-1 1 2B-1 xB-2 1 2B-2
x0 1 20 ) -
- cN-1 (xB-1 N-1 2B-1 xB-2 0 2B-2
x0 N-1 20 ) - (c0 xB-1 0 c1 xB-1 1 cN-1
xB-1 N-1) 2B-1 - (c0 xB-1 0 c1 xB-2 1
cN-1 xB-2 N-1) 2B-2 -
- (c0 x0 0 c1 x0 1 cN-1
x0 N-1) 20 - ? 2b ? cn xb k
- where n0, 1, , N-1 and b0, 1, , B-1
-
UC Santa Barbara
ICCD 2006
7DA (Distributed Arithmetic) ImplementationSerial
A Serial DA Filter Block Diagram
- n1 clock cycles are needed for an n
- but input symmetrical filter to
- generate the output.
- Performance is limited by the fact
- that the next input sample can be
- processed only after every bit of the
- current input samples are processed
- The tradeoff here is performance for
- area
-
UC Santa Barbara
ICCD 2006
8DA (Distributed Arithmetic) ImplementationParalle
l
- The performance of the circuit can
- be improved by modifying the
- architecture to a parallel architecture
- which processes the data bits in
- groups
- Increasing the number of bits
- sampled has a significant effect on
- resource utilization on FPGA.
- More LUTs
- Larger size scaling accumulator
-
A 2 bit parallel DA Filter Block Diagram
UC Santa Barbara
ICCD 2006
9CSE (Common Subexpression Elimination)
- Linear systems can be modeled using polynomials.
Expressions consist of ,-,ltlt operators. - Polynomial formulation
C X ?(XLi)
(14)10 X (1110)2 X
Xltlt3 Xltlt2 Xltlt1 XL3
XL2 XL1
UC Santa Barbara
ICCD 2006
10CSE Example
Y0 X0 X1 X2 X3 Y1 2X0 X1 X2
2X3 Y2 X0 X1 X2 X3 Y3 X0 2X1
2X2 X3
Y0 1 1 1 1 X0
Y1 2 1 -1 -2 X1
Y2 1 -1 -1 1 X2 Y3
1 -2 2 -1 X3
Y0 X0 X1 X2 X3 Y1 X0L X1 X2
X3L Y2 X0 X1 X2 X3 Y3 X0 X1L
X2L X3
UC Santa Barbara
ICCD 2006
11CSE Example
Y0 X0 X1 X2 X3 Y1
X0L X1 - X2 - X3L Y2 X0 - X1 -
X2 X3 Y3 X0 - X1L X2L - X3
Y0 D0 X1 X2 Y1 X0L
X1 - X2 - X3L Y2 D0 - X1 - X2 Y3
X0 - X1L X2L - X3
UC Santa Barbara
ICCD 2006
12CSE Example
Y0 D0 D2 Y1 X0L D1 - X3L Y2
D0 - D2 Y3 X0 - D1L - X3
UC Santa Barbara
ICCD 2006
13CSE Example
12 additions 4 shifts
Y0 X0 X1 X2 X3 Y1
X0L X1 - X2 - X3L Y2 X0 - X1 -
X2 X3 Y3 X0 - X1L X2L - X3
D0 X0 X3 Y0 D0 D2 D1 X1
X2 Y1 D1 D3L D2 X1
X2 Y2 D0 - D2 D3 X0 -
X3 Y3 D3 D1L
8 additions 2 shifts
UC Santa Barbara
ICCD 2006
14FIR Filter Add/Shift ImplementationReplacing
Constant Multiplication by Multiplier Block
UC Santa Barbara
ICCD 2006
15FIR Filter Add/Shift ImplementationRegistered
Adder at no Additional Cost
UC Santa Barbara
ICCD 2006
16Extracting Common Subexpressions
F1 A B C D F2 A B C E
Optimization
Extracting Common Expression (A B C)
Unoptimized Expression Trees
Extracting Common Expression (A B)
UC Santa Barbara
ICCD 2006
17Synchronization
- Extra registers are needed to
- synchronize the intermediate values,
- such that new values for A,B,C,D,E,F
- can be read in every clock cycle
-
Calculating registers required for fastest
evaluation
UC Santa Barbara
ICCD 2006
18Experiment ResultsResource Utilization/Performanc
e
Filter Implementation Using Add and Shift Method
Filter Implementation Using Xilinx Coregen (PDA)
UC Santa Barbara
ICCD 2006
19Experiment ResultsResource Utilization
UC Santa Barbara
ICCD 2006
20Experiment ResultsPower Consumption
UC Santa Barbara
ICCD 2006
21Creating MAC Filters Using Xilinx Coregen
UC Santa Barbara
ICCD 2006
22Experiment ResultsComparison with MAC Filters
Using Multiplier Blocks
UC Santa Barbara
ICCD 2006
23Experiment ResultsComparison with MAC Filters
Using Multiplier Blocks Resource Utilization
UC Santa Barbara
ICCD 2006
24Experiment ResultsComparison with MAC Filters
Using Multiplier Blocks - Performance
UC Santa Barbara
ICCD 2006
25Conclusion/Observations
- Presented a multiplierless technique, based on
the add and shift method and common subexpression
elimination for low area, low power and high
speed implementations of FIR filters. - Validated our techniques on Virtex II/IV devices
where we observed significant area and power
reductions over traditional Distributed
Arithmetic based techniques. - an average reduction of 58.7 in the number of
LUTs, and about 25 reduction in the number of
slices and FFs. - Better performance in most of the cases even
though our algorithm does not optimize for
performance - Observed up to 50 reduction in dynamic power
consumption - Higher performance as the filter size increases.
- Critical path in our design consists of adders
while in MAC method, critical path consists of
multipliers and adders.
UC Santa Barbara
ICCD 2006