Title: Techniques for Low Power Turbo Coding in Software Radio
1Techniques for Low Power Turbo Coding in Software
Radio
2Software Defined Radio
- Single transmitter for many protocols
- Protocols completely specified in memory
- Implementation
- Microprocessors
- Field programmable logic
3Why Use Software Radio?
- Wireless protocols are constantly reinvented
- 5 Wi-Fi protocols
- 7 Bluetooth protocols
- Proprietary mice and keyboard protocols
- Mobile phone protocol alphabet soup
- Custom DSP logic for each protocol is costly
4So Why Not Use Software Radio?
- Requires high performance processors
- Consumes more power
5Turbo Coding
- Channel coding technique
- Throughput nears theoretical limit
- Great for bandwidth limited applications
- CDMA2000
- WiMAX
- NASA s Messenger probe
6Turbo Coding Considerations
- Presents a design trade-off
- Turbo coding is computationally expensive
- But it reduces cost in other areas
- Bandwidth
- Transmission power
7Reducing Power in Turbo Decoders
- FPGA turbo decoders
- Use dynamic reconfiguration
- General processor turbo decoders
- Use a logarithmic number system
8Generic Turbo Encoder
Data stream
s
Component Encoder
p1
Component Encoder
Interleave
p2
9Generic Turbo Decoder
Decoder
Decoder
Interleave
r
q1
q2
10Decoder Design Options
- Multiple algorithms used to decode
- Maximum A-Posteriori (MAP)
- Most accurate estimate possible
- Complex computations required
- Soft-Output Viterbi Algorithm
- Less accurate
- Simpler calculations
11FPGA Design Options
- Goal Make an adaptive decoder
Decoder
Received Data
Original sequence
Parity
12Component Encoder
M
M
Generator Function
- M blocks are 1-bit registers
- Memory provides encoder state
13Encoder State
0
1
00
00
00
GF
01
01
01
0
1
10
10
10
11
11
11
1
0
Time
14Viterbis Algorithm
- Determine most likely output
- Simulate encoder state given received values
Time
15Viterbis Algorithm
- Write Compute branch metric (likelihood)
- Traceback Compute path metric, output data
- Update Compute distance between paths
- Rank paths by path metric and choose best
- For N memory
- Must calculate 2N-1 paths for each state
16Adaptive SOVA
- SOVA Inflexible path system scales poorly
- Adaptive SOVA Heuristic
- Limit to M paths max
- Discard if path metric below threshold T
- Discard all but top M paths when too many paths
17Implementing in Hardware
Control
Branch Metric Unit
Add Compare Select
Survivor memory
r
q
18Implementing in Hardware
- Add, Compare, Select
- Append path metric
- Discard paths
- Survivor Memory
- Store / discard path bits
- Controller
- Control memory
- select paths
- Branch Metric Unit
- Compute likelihood
- Consider all possible next states
19Implementing in Hardware
- Add, Compare, Select Unit
Present State Path Values
Next State Path Values
Path Distance
Compute, ComparePaths
gt T
Branch Values
Threshold
20Dynamic Reconfiguration
- Bit Error Rate (BER)
- Changes with signal strength
- Changes with number of paths used
- Change hardware at runtime
- Weak signal use many paths, save accuracy
- Strong signal use few paths, save power
- Sample SNR every 250k bits, reconfigure
21Dynamic Reconfiguration
22Experimental Results
K (Number of encoder bits) proportional to
average speed, power
23Experimental Results
- FPGA decoding has a much higher throughput
- Due to parallelism
24Experimental Results
- ASOVA performs worse than commercial cores
- However, in other metrics it is much better
- Power
- Memory usage
- Complexity
25Future Work
- Use present reconfiguration means to design
- Partial reconfiguration
- Dynamic voltage scaling
- Compare to power efficient software methods
26Power-Efficient Implementation of a Turbo Decoder
in SDR System
- Turbo coding systems are created by using one of
three general processor types - Fixed Point (FXP)
- Cheapest, simplest to implement, fastest
- Floating Point (FLP)
- More precision than fixed point
- Logarithmic Numbering System (LNS)
- Simplifies complex operations
- Complicates simple add/subtract operations
27Logarithmic Numbering System
- X s, x log(b)x
- S sign bit, remaining bits used for number
value - Example
- Let b 2,
- Then the decimal number 8 would be represented as
log(2)8 3 - Numbers are stored in computer memory in 2s
compliment form (3 01111101) (sign bit 0)
28Why use Logarithmic System?
- Greatly simplifies multiplication, division,
roots, and exponents - Multiplication simplifies to addition
- E.g. 8 4 32, LNS gt 3 2 5
- (25 32)
- Division simplifies to subtraction
- E.g. 8 / 4 2, LNS gt 3 2 1
- (21 2)
29Why use Logarithmic System?
- Roots are done as right shifts
- E.g. sqrt(16) 4,
- LNS gt 4 shifted right 2
- (22 4)
- Exponents are done as left shifts
- E.g. 82 64, LNS gt 3 shifted left 6
- (26 64)
30So why not use LNS for all processors?
- Unfortunately addition and subtraction are
greatly complicated in LNS. - Addition log(b)x y x log(b)1 bz
- Subtraction log(b)x - y x log(b)1 -
bz - Where z y x
- Turbo coding/decoding is computationally intense,
requiring more mults, divides, roots, and exps,
than adds or subtracts
31Turbo Decoder block diagram
- Use present reconfiguration means to design
- Partial reconfiguration
- Dynamic voltage scaling
- Compare to power efficient software methods
- Each bit decision requires a subtraction, table
look up, and addition
32Proposed new block diagram
- As difference between ea and eb becomes larger,
error between value stored in lookup table vs.
computation becomes negligible. - For this simulation a difference of gt5 was used
33How it works
- For d gt 5
- New Mux (on right) ignores SRAM input and simply
adds 0 to MAX result. - d gt 5, pre-Decoder circuitry disables the SRAM
for power conservation.
34Comparing the 3 simulations
- Comparisons were done between a 16-bit fixed
point microcontroller, a 16-bit floating point
processor, and a 20-bit LNS processor. - 11-bits would be sufficient for FXP and FLP, but
16-bit processors are much more common - Similarly 17-bits would suffice for LNS
processor, but 20-bit is common type
35Power Consumption
36Latency
37Power savings
- Pre-Decoder circuitry adds 11.4 power
consumption compared to SRAM read. - So when an SRAM read is required, we use 111.4
of the power compared to the unmodified system - However, when SRAM is blocked we only use 11.4
of the power we used before.
38Power savings
- The CACTI simulations for the system reported
that the Max operation accounted for 40 of all
operations in the decoder - The Max operations for the modified system
required 69 of the power when compared to the
unmodified system. - This leads to an overall power savings of
- 69 40 27.6
39Conclusion
- Turbo codes are computationally intense,
requiring more complex operations than simple
ones - LNS processors simplify complex operations at the
expense of making adding and subtracting more
difficult
40Conclusion
- Using a LNS processor with slight modifications
can reduce power consumption by 27.6 - Overall latency is also reduced due to ease of
complex operations in LNS processor when compared
to FXP or FLP processors.