ECE 699Digital Signal Processing Hardware Implementations Lecture 11 - PowerPoint PPT Presentation

1 / 48
About This Presentation
Title:

ECE 699Digital Signal Processing Hardware Implementations Lecture 11

Description:

Can exploit shared twiddle factor properties (i.e. sub-expression sharing) to ... two properties in the twiddle factors: Symmetry Property: Periodicity Property: ... – PowerPoint PPT presentation

Number of Views:287
Avg rating:3.0/5.0
Slides: 49
Provided by: david815
Category:

less

Transcript and Presenter's Notes

Title: ECE 699Digital Signal Processing Hardware Implementations Lecture 11


1
ECE 699Digital Signal Processing Hardware
ImplementationsLecture 11
  • Low-Power Design, Final Review
  • 4/22/09

2
Outline
  • Low-Power Design
  • Final Review

3
Reading
  • Low-Power Design
  • Parhi, VLSI Digital Signal Processing Systems
  • Chapter 17

4
Low-Power Design
5
Design Criteria
Source Parhi/Owall
6
Low-Power Trends
Source Parhi/Owall
7
Peak Power and Average Power
Source Parhi/Owall
8
CMOS Power Consumption
Source Parhi/Owall
9
CMOS Dynamic Power
Source Parhi/Owall
10
System Integration
Source Parhi/Owall
11
Switching Activity
Source Parhi/Owall
12
Glitching
Source Parhi/Owall
13
Clock Gating
Source Parhi/Owall
14
Ripple Carry Glitching
Source Parhi/Owall
15
Balancing Operations
Source Parhi/Owall
16
Delay vs. Supply Voltage and Threshold Voltage
Source Parhi/Owall
17
Dual Vt Technology
Source Parhi/Owall
18
High VT stand-by
Source Parhi/Owall
19
Final Exam Review
20
Lecture 6 Highlights
  • Lecture 6 began with a study of the Discrete Time
    Fourier Transform (DTFT) and continued to a
    sample version of the DTFT, called the Discrete
    Fourier Transform
  • The FFT was introduced as a computationally-effici
    ent mechanism to implement the DFT
  • Radix-2 FFT (DIF and DIT)
  • Radix-4 FFT (DIF and DIF)
  • Finally, various implementation issues were
    discussed including FFT architectures (serial,
    parallel, pipeline, etc.) and bit-level issues

21
Fast Fourier Transform
  • Can exploit shared twiddle factor properties
    (i.e. sub-expression sharing) to reduce the
    number of multiplications in DFT
  • These class of algorithms are called Fast Fourier
    Transforms
  • An FFT is simply an efficient implementation of
    the DFT
  • Mathematically FFT DFT
  • FFT exploits two properties in the twiddle
    factors
  • Symmetry Property
  • Periodicity Property
  • FFTs use a divide and conquer approach, breaking
    an N-point DFT into several smaller DFTs
  • N can be factored as Nr1r2r2rv where the ri
    are prime
  • Particular focus on r1r2..rvr, where r is
    called the radix of the FFT algorithm
  • In this case Nrv and the FFT has a regular
    pattern
  • We will study radix-2 (r2) and radix-4 (r4)
    FFTs in this class

22
Decimation-in-time Radix-2 FFT
  • Split x(n) into even and odd samples and perform
    smaller FFTs
  • f1(n) x(2n)
  • f2(n) x(2n1)
  • n0, 1, N/2-1
  • Derivation performed in class
  • Radix-2 Decimation-in-time (DIT) algorithm
  • In radix-2, the "butterfly" element takes in 2
    inputs and produces 2 outputs
  • Butterfly implements 2-point FFT
  • Computations
  • (N/2)log2N complex multiplications
  • Nlog2N complex additions

23
Decimation-in-time Radix-2 FFT (N8)
24
Decimation-in-frequency Radix-2 FFT
  • Decompose X(k) such that it is split into FFT of
    points 0 to N/2-1 and points N/2 to N-1
  • Then decimate X(k) into even and odd numbered
    samples
  • Derivation performed in class
  • Radix-2 Decimation-in-frequency (DIF) algorithm
  • In radix-2, the "butterfly" element takes in 2
    inputs and produces 2 outputs
  • Butterfly implements 2-point FFT
  • Computations
  • (N/2)log2N complex multiplications
  • Nlog2N complex additions

25
Radix-4 FFT
  • In radix-2 you have log2N stages
  • Can also implement radix-4 and now have log4N
    stages
  • Radix-4 Decimation-in-time split x(n) into four
    time sequences instead of two
  • Derivation performed in class
  • Split x(n) into four decimated sample streams
  • f1(n) x(4n)
  • f2(n) x(4n1)
  • f3(n) x(4n2)
  • f4(n) x(4n3)
  • n0, 1, .. N/4-1
  • Radix-4 Decimation-in-time (DIT) algorithm
  • In radix-4, the "butterfly" element takes in 4
    inputs and produces 4 outputs
  • Butterfly implements 4-point FFT
  • Computations
  • (3N/4)log4N (3N/8)log2N complex multiplications
    ? decrease from radix-2 algorithms
  • (3N/2)log2N complex additions ? increase from
    radix-2 algorithms
  • Downside can only deal with FFTs of a factor of
    4, such as N4, 16, 64, 256, 1024, etc.

26
Parallel Implementation
  • Implement entire FFT structure in a parallel
    fashion
  • Advantages Control is easy (i.e. no controller),
    low latency (i.e. 0 cycles in this example),
    customize each twiddle factor as a multiplication
    by a constant
  • Disadvantages Huge Area, Routing congestion

27
Serial/In-Place FFT Implementation
  • Implement a single butterfly. Use that butterfly
    and some memory to compute entire FFT
  • Advantages Small area
  • Disadvantages Large latency, complex controller

28
Pipeline FFT
Slice 1
Slice 2
Slice 3
Slice 4
  • Pipeline FFT is very common for communication
    systems (OFDM, DMT)
  • Implements an entire "slice" of the FFT and
    reuses-hardware to perform other slices
  • Advantages Particularly good for systems in
    which x(n) comes in serially (i.e. no block
    assembly required), very fast, more area
    efficient than parallel, can be pipelined
  • Disadvantages Controller can become complicated,
    large intermediate memories may be required
    between stages, latency of N cycles (more if
    pipelining introduced)

29
Lecture 8 Highlights
  • Lecture 8 covered CORDIC architectures,
    discussing
  • Rotations vs. pseudorotations
  • CORDIC in vectoring and rotation modes
  • CORDIC hardware architecture
  • Extension of CORDIC
  • We also discussed direct digital frequency
    synthesizers
  • Showed basic structures
  • Discussed improvements, particularly in ROM
    compression
  • Discussed potential sources of error/spurs in
    DDFS circuits

30
22.1 Rotations and Pseudorotations
Key ideas in CORDIC
COordinate Rotation DIgital Computer used this
method in 1950s modern electronic calculators
also use it
If we have a computationally efficient way of
rotating a vector, we can evaluate cos, sin, and
tan1 functions Rotation by an arbitrary angle
is difficult, so we Perform
psuedorotations that require simpler operations
Use special angles to synthesize the desired
angle z z a (1) a (2) . . . a (m)
-
Source Parhami
31
22.2 Basic CORDIC Iterations
CORDIC iteration In step i, we pseudorotate by
an angle whose tangent is di 2i (the angle e(i)
is fixed, only direction di is to be picked)
x(i1) x(i) di y(i) 2i
y(i1) y(i) di x(i) 2i z(i1) z(i)
di tan 1 2i z(i) di e(i)
i
0
45.0 0.785 398 163 1 26.6 0.463 647 609
2 14.0 0.244 978 663 3 7.1 0.124
354 994 4 3.6 0.062 418 810 5
1.8 0.031 239 833 6 0.9 0.015 623 728
7 0.4 0.007 812 341 8 0.2 0.003
906 230 9 0.1 0.001 953
123
Table 22.1 Value of the function e(i) tan
1 2i, in degrees and radians, for 0 ? i ? 9
e(i) in degrees (approximate)
e(i) in radians (precise)
Example 30? angle 30.0 ? 45.0 26.6 14.0
7.1 3.6 1.8
0.9 0.4 0.2 0.1
30.1
Source Parhami
32
Using CORDIC in Rotation Mode
x(i1) x(i) di y(i) 2i
y(i1) y(i) di x(i) 2i z(i1) z(i)
di tan 1 2i z(i) di e(i)
x(m) K(x cos z y sin z)
y(m) K(y cos z x sin z) z(m)
0 where K 1.646 760 258 121 . . .
Make z converge to 0 by choosing di sign(z(i))
Start with x 1/K 0.607 252 935 . . . and
y 0 to find cos z and sin z
For k bits of precision in results, k CORDIC
iterations are needed, because tan 1 2i ? 2I
for large i
Convergence of z to 0 is possible because each of
the angles in our list is more than half the
previous one or, equivalently, each is less than
the sum of all the angles that follow it
Domain of convergence is 99.7 z 99.7,
where 99.7 is the sum of all the angles in our
list the domain contains ?/2, ?/2 radians
Source Parhami
33
Using CORDIC in Vectoring Mode
x(i1) x(i) di y(i) 2i
y(i1) y(i) di x(i) 2i z(i1) z(i)
di tan 1 2i z(i) di e(i)
x(m) K(x2 y2)1/2 y(m) 0
z(m) z tan 1(y / x)
where K 1.646 760 258 121 . . .
Make y converge to 0 by choosing di
sign(x(i)y(i))
Start with x 1 and z 0 to find tan 1 y
For k bits of precision in results, k CORDIC
iterations are needed, because tan 1 2i ? 2I
for large i
Even though the computation above always
converges, one can use the relationship tan
1(1/y ) p/2 tan 1y to limit the range
of fixed-point numbers encountered
Other trig functions tan z obtained from sin z
and cos z via division inverse sine and cosine
(sin 1 z and cos 1 z) discussed later
Source Parhami
34
22.3 CORDIC Hardware
x(i1) x(i) di y(i) 2i
y(i1) y(i) di x(i) 2i z(i1) z(i)
di tan 1 2i z(i) di e(i)
If very high speed is not needed (as in a
calculator), a single adder and one shifter would
suffice
Fig. 22.3 Hardware elements needed for the
CORDIC method.
Source Parhami
35
Overview of Frequency Synthesizers
  • A frequency synthesizer is a device which
    generates many output frequencies from a single
    input reference frequency using direct, indirect,
    or digital synthesis techniques
  • Three different types of frequency synthesizers
  • Indirect Frequency Synthesizer
  • Produce an output frequency from a secondary
    oscillator frequency, usually a voltage
    controlled oscillator (VCO) phase locked to a
    primary frequency
  • Direct Frequency Synthesizer
  • Produces multiple output frequencies from a
    single frequency standard using a series of
    mixing, multiplication, division, and filtering
    stages
  • Direct Digital Frequency Synthesizer also called
    a Numerically Controlled Oscillator (NCO)
  • A digital synthesizer, as suggested by its name,
    utilizes digital circuitry to generate output
    frequencies
  • Direct Digital Frequency Synthesizer first
    describe in a paper by J. Tierney, C. M. Rader,
    and B. Gold in 1971

Courtesy D. Wilson
36
Advantages and Disadvantages of a DDFS
  • Advantages
  • Fine Frequency Resolution (sub-Hertz)
  • Lower Power Consumption
  • Fast Switching Speed
  • Wide Tuning Bandwidth
  • Low Phase Noise
  • Continuous Phase Switching Response
  • Disadvantages
  • A DDFS generates a sinc(x) output frequency
    spectrum containing the desired output frequency
    plus harmonics which must be filtered out
  • A DDFS produces spurious frequencies or spurs
    resulting from phase word truncation and
    imperfections in the digital-to-analog converter
    (DAC)
  • A DDFS requires a digital-to-analog converter
    (DAC)
  • DACs are the greatest cause of spurs in
    high-speed and high-resolution (gt10 bits, gt50
    MHz) DDFS applications
  • DACs susceptible to spurs created by clock
    feedthrough,
  • intermodulation, and glitch energy

Courtesy D. Wilson
37
Basic DDFS Architecture
  • Structure of a DDFS is fairly simple
  • Major components are a Phase Accumulator, Phase
    to Amplitude Converter, a D/A Converter, and a
    Low Pass or Inverse Sinc Filter

Courtesy D. Wilson
38
Lecture 9 Highlights
  • Lecture 9 covered retiming in detail
  • We first discussed timing basics and applications
    of retiming
  • We then discussed cutset retiming and pipelining
  • Finally we discussed an algorithm used for
    retiming to reduce the clock period of a
    recursive system (i.e. have clock period meet the
    iteration bound)

39
Retiming Introduction
  • Retiming moves around registers which already
    exist in the system
  • Retiming does not alter the latency in the system
  • Retiming does not change the input/output
    characteristics
  • Retiming DOES change the critical path of the
    system and/or the number of registers in the
    system
  • Uses the primary rules

D
D
D
D


D
D
40
Retiming Uses
  • Retiming used
  • 1) to decrease minimum clock period of a circuit
    (i.e. faster)
  • 2) to reduce number of registers of a circuit
    (i.e. smaller)
  • 3) for logic synthesis (not covered in class)
  • 4) for low power CMOS circuits

41
Cutset Retiming
  • Two special cases of retiming exist
  • Cutset retiming
  • Pipelining pipelining can be considered as
    adding a number of registers in the front of the
    DFG and then doing retiming on these new
    registers
  • Cutset retiming
  • Cutset set of edges that can be removed from
    graph to create 2 disconnect subgraphs
  • Cutset retiming only affects the weights of the
    edges in the cutset.
  • If 2 disconnected subgraphs are G1 and G2 then
    cutset retiming consists of adding k delays to
    each edge from G1 to G2 and removing k delays
    from each edge from G2 to G1
  • Cutset retiming is a special case of retiming
    where each node in the graph G1 has the retiming
    value j and each node in the subgraph G2 has the
    retiming value jk (j is arbitrary)
  • Remember Retiming solution is feasible only if
    wr(e) gt 0 for all edges

42
Algorithm for Retiming for Clock Period
Minimization
  • Algorithm for retiming for clock period
    minimization
  • First construct W(U,V) and D(U,V)
  • 1) Let Mtmaxn where tmax is the maximum
    computation time of the nodes in G and n is the
    number of nodes in G.
  • 2) Form a new graph G' which is the same as G
    except the edge weights are replaced by w'(e)
    Mw(e) t(U) for all edges e for U?V
  • 3) Solve the all-pairs shortest path problem on
    G' (using Floyd-Warshall, for example). Let S'UV
    be the shortest path from U to V.
  • 4) If U ? V, then W(U,V) ceil(S'UV/M) and
    D(U,V) MW(U,V) - S'UV t(V). If UV, then
    W(U,V) 0 and D(U,V) t(U). Ceil() is the
    ceiling function.
  • Use W(U,V) and D(U,V) to determine if there is a
    retiming solution that can achieve a desired
    clock period c.
  • Usually set this desired clock period equal to
    the iteration bound of the circuit.

43
Algorithm for Retiming for Clock Period
Minimization cont'd
  • Given a desired clock period c, there is a
    feasible retiming solution r such that F(Gr) lt c
    if the following constraints hold
  • CONSTRAINT 1 (feasibility) r(U) r(V) lt w(e)
    for every U?V along edge e of G
  • This enforces the numbers of delays on each edge
    in the retimed graph to be nonnegative
  • CONSTRAINT 2 (critical path) r(U) r(V) lt
    W(U,V) 1 for all vertices U,V, in G such that
    D(U,V) gt c
  • This enforces F(Gr) lt c
  • Thus, to find a solution
  • 1) pick a value of c (usually equal to iteration
    bound)
  • 2) Create a series of inequalities based on the
    feasibility constraint.
  • 3) Create a series of inequalities based on the
    critical path constraint.
  • 4) Combine these (using most restrictive if
    overlap exists) and create a constraint graph.
  • 5) Find feasibility using shortest-path algorithm
    (i.e. Floyd-Warshall) and find retiming values

44
Lecture 10 Highlights
  • Lecture 10 began with a discussion of unfolding
    and its use to reduce the critical path of the
    circuit, as well as for parallel processing
  • Folding was also introduced for area
    minimization, and an algorithm was presented to
    achieve a folded structure

45
Unfolding Algorithm
Source Parhi
46
Applications of Unfolding
Source Parhi
47
Folding
Source Parhi
48
Folding Transformation
Source Parhi
Write a Comment
User Comments (0)
About PowerShow.com