Title: Digital Filtering In Hardware
1Digital Filtering In Hardware
2Introduction
- Digital filtering vs Analog filtering
- More robust (process variations, temperature),
flexible (bit precision, program), store
recover - Lower performance (esp high freq), more
area/power, cannot sense, need data-converters - Can perform digital filtering in hardware or
software - Software (DSP/generic microprocessors) flexible,
less up-front cost - Hardware (ASIC/FPGA) customized, cheaper in
volume, lower area/power
3Applications
- Applications noise filtering, equalization,
image processing, seismology, radar, ECC,
audio/image compression - Focus implementing difference equations
- No feedback FIR, feedback IIR
- Assume coefficient synthesis done
- Operate almost exclusively in time domain (FFT
done)
4Evolution
5Various Representations
- 3-tap FIR
- Non terminating, repeatedly execute same code
- iteration Execute all operations, iteration
period time to perform iteration, iteration
rate inverse of iteration period - sampling rate (aka throughput) number of samples
per second, critical path max combinational
delay (no wave pipelining!) - Block Diagram
- Close to actual hardware interconnected
functional blocks, potentially with delay
elements between blocks
6Block Diagram
7Block Diagram
8Signal Flow Graph
- Unique source, sink (input and output)
- Edges represent const multiplier, delay
- Nodes represent I/O, adder, mult
- Useful for wordlength effects, less for
architecture design
9SFG
10Dataflow Graph
- DFG
- Nodes computations (functions, subtasks)
- Edges datapaths
- Capture data-driven nature of DSP,
intra-iteration and inter-iteration constraints - Very general nonlinear, multirate, asynchronous,
synchronous - Difference from block diagram
- Hardware not allocated, scheduled in DFG
11DFG
12DFG
13Multirate DFG
14Iteration Bound
- In DFG, execution each node once in an iteration
- All nodes executed iteration
- Critical path combinational path with maximum
total execution time (Note were reserving the
term delay for sequential delay) - Loop (cycle) path beginning and ending at same
node - Loop bound for loop L TL/WL
- Iteration Bound maximum of all loop bounds
- Lower bound on execution time for DFG (assuming
only pipelining, retiming, unfolding)
15Iteration Bound
16Iteration Bound
17Iteration Bound
182.3
192.4
202.5
212.6
222.7
23Pipeline and Parallelize
- Pipelining insert delay elements to reduce
critical path length - Faster (more throughput), lower power
- Added latency, latches/clocking
- Parallelism compute multiple outputs in a single
clock cycle - Faster, lower power
- Added hardware, sequencing logic
24Pipelining
- General applicable to microprocessor
architectures, logic circuits, DFGs - Have to place delays (flops) carefully
- On feed forward cutsets
25Pipelining
26Pipelining Parallel
27Pipelining
28Feed-forward Cutset
29Transposition
30Transposition
31Data Broadcast
32Fine-grain Pipelining
33Parallel Processing
- Process blocks at a time
- Clock period L Sample Rate
34Parallelism
35Parallelism
36Components
37Need for Parallelism
38Parallelism
- Why not use pipelining?
- May have a single large delay element that cannot
be divided (communication between chips) - Can use in conjunction with pipelining
- Relatively less efficient than pipelining (area
cost and power savings) - Note that weve skirted the issue of parallizing
general DFGs - Loops make life hard
39Parallelize Pipeline
40Area Efficiency
41Pipelining Processors
- Classic DLX processor
- ISA Load/Store or Mem Access
- 5 stages IF, ID, EX, MEM, WB
- Pipelining processors is hard
- Data hazards
- ADD r1, r2, r3 SUB r4, r5, r1
- Solution Use bypass logic
- LD r1, r2 ADD r4, r1, r2
- Solution?
- Branch hazards
- PC not changed till end of ID
- Solution redo IF (only) if branch taken
- Pipelining DFGs is easy (no control flow!)
42Pipelining Processors
43Retiming
- Basic idea (for logic circuits)
- Move flops back and forth across gates
- Use for clock period reduction, flop
minimization, power minimization, resynthesis - Same idea holds for DFGs
- Examples
- Algorithm
- C-slow retiming
44Retiming
45Retiming
46Cutset Retiming
47Cutset Retiming
48C-Slow Retiming
49Min Delay Retiming
- Formalize use notion of retiming function on
nodes - Amount of delay pushed back of node (can be
negative think of as retardation function) - Want to know if cycle time TC is feasible
- set up constraints
- Long paths have to be broken
- No negative delays on edges
- Solve using a custom ILP
- Uses efficient graph algorithms
50Unfolding
- Analagous to loop unrolling for programs
- for (I1 Ilt 5 I) aI bIcI
- Many benefits, at the price of potential increase
in code size - Look at 2-unfolding of
- Y(n) x(n) a y(n-9)
- General algorithm for J-unfolding a DFG
- Uses J nodes for each original node, new delay
values - Nontrivial fact algorithm works
51Unfolding
52Unfolding
53Applications
- Meet iteration bound
- When a single node has large execution time
- When IB is nonintegral
54Applications IB
55Application fractional IB
56Applications Parallelize
- Recall in Chapter 3, we never gave a systematic
way of generating parallel circuits - Loop unfolding gives a way
57Applications Bit-Digit
- Convert a bit-serial architecture to a
digit-serial architecture
58Folding
- Trade area for time
- Use same hardware unit for multiple nodes in the
DFG - Example y(n) a(n) b(n) c(n)
- Need general systematic approach to folding
- Math formulation folding orders, folding sets,
folding factors
59Folding
60Folding
61Folding
62Folding
63Folding
64Folding
65Register Minimization
- Consider DSP program that produces 3 variables
- a 1,2,3,4
- b 2,3,4,5,6,7
- c 5,6,7
- Number of live variables 1,2,2,2,2,2,2
- Intuitively, should be able to get by with 2
registers - However, DSP programs are periodic
- May have variable live across iterations
-
66Linear Lifetime Chart
67Lifetime Analysis Matrix
68Lifetime Chart Matrix
69Register Allocation Table
70Reg Assignment Matrix
71Reg Assignment Biquad
72Reg Assignment Biquad
73Pipelined Parallel IIR
- Feedback loops makes pipelining and parallelism
very hard - Impossible to beat iteration bound without
rewriting the difference equation - Example
- Pipeline interleaving of y(n1) a y(n) b u(n)
- Note that IB goes up, but can run multiple
streams in parallel
74Pipeline Interleaved IIR
75Pipeline Interleaved IIR
76Pipeline Interleaved IIR
77Pipelining 1-st Order IIR
- Y(n) a y(n) u(n)
- Sample rate set by multiply and add time
- Can do better by look ahead pipelining
- Basically, changing the difference equation to
get more delays in the loop - Key functionality unchanged
- Best understood in terms of Z-transforms
78Pipelining 1-st Order IIR
79Pipelining 1-st Order IIR
80Pipelining High Order IIR
- Three basic approaches
- Clustered look-ahead
- Scattered look-ahead
- Direct synthesis with constraints
81Pipelining High Order IIR
82Pipelining High Order IIR
83Pipelining High Order IIR
84Pipelining High Order IIR
85Pipelining High Order IIR
86Pipelining High Order IIR
87Pipelining High Order IIR
88(No Transcript)
89(No Transcript)
90(No Transcript)
91(No Transcript)
92(No Transcript)
93(No Transcript)
94(No Transcript)
95(No Transcript)
96(No Transcript)
97(No Transcript)
98(No Transcript)
99(No Transcript)
100(No Transcript)
101(No Transcript)
102(No Transcript)