DSP architectures for wireless communications - PowerPoint PPT Presentation

1 / 30
About This Presentation
Title:

DSP architectures for wireless communications

Description:

Decoded bits. Kernels (computation) and streams (communication) ... Parallel Viterbi Decoding. 1. Add-Compare-Select (ACS) : trellis interconnect ... – PowerPoint PPT presentation

Number of Views:100
Avg rating:3.0/5.0
Slides: 31
Provided by: Srid
Category:

less

Transcript and Presenter's Notes

Title: DSP architectures for wireless communications


1
DSP architectures for wireless communications
  • Sridhar Rajagopal
  • Department of Electrical and Computer Engineering
  • Rice University, Houston TX
  • ECE Pizza Talk
  • March 28, 2003

This work has been supported in part by Nokia,
TI, TATP and NSF
2
Future wireless devices
  • High data rate mobile devices with multimedia
  • Multiple antennas w/ complex algorithms, GOPs of
    computation
  • Area-Time-Power constraints
  • Seamless connection across environments and
    standards
  • Use the fastest and cheapest available service

3
Aim of the talk
4
Trends
FLEXIBILITY
5
Change in flexibility requirements
No change (already flexible)
Maximum change (needs to support multiple
environments, algorithms and standards)
6
Architecture trade-offs
  • Past more DSP less ASIC, Current less DSP
    more ASIC
  • Reason need less flexibility OR DSPs not
    powerful enough?
  • Cant we build better DSPs?
  • How much flexibility do we need?

7
Problems with current DSPs
  • Current DSPs
  • Not enough functional units (FUs) for GOPs of
    computation
  • Need 100s of FUs
  • Not low power enough!!
  • Cannot extend to more FUs
  • Limited Instruction Level Parallelism (ILP)
  • Limited Subword Parallelism (such as MMX)
  • Cannot support more registers (area,ports)
  • Compilers difficult to find ILP as FUs increase

8
Scalable Wireless Application-specific Procesors
(SWAPs)
  • Exploit data parallelism (DP)
  • Available in many wireless algorithms
  • This is what ASICs do!!
  • Example
  • int i,a,b,c // 32 bits
  • short int d,e,f // 16 bits packed
  • for (i 1 ilt 1024 i)
  • ai bi ci
  • di ei fi

DP
ILP
Subword
9
SWAPs stream processors for wireless
  • Kernels (computation) and streams (communication)
  • Operations on kernels use local data
  • Streams expose data parallelism
  • Imagine stream processor at Stanford

10
DSP vs. SWAPs
Stream Register File (SRF)
SWAPs (max. clusters All clusters same do same
operations)
DSP (1 cluster)
11
Arithmetic clusters
  • FUs (,,/)
  • Scratch-pad (Sp)
  • Indexed accesses
  • Comm. unit (CU)
  • Intercluster comm.
  • Distributed reg. Files
  • more FUs

From/To SRF
Local Register File












SRF
/
Cross Point
/
/
/
Sp
Intercluster Network
CU
12
SWAPs vs. DSPs trade-offs
  • Same internal memory size as DSPs
  • Dependent on application, not architecture
  • Needs more area to support more functional units
  • Area is less of a constraint than power
  • Varying levels of DP in applications
  • Needs reconfiguration!!
  • Need to turn off unused clusters (and FUs)
  • More parallelism ? lower clock frequency ? lower
    voltage
  • ? low power (?CV2f leakage) in spite of
    larger area

13
Design methodology
Chain of receiver algorithms
Low complexity, parallel, fixed point
Flexibility- performance tradeoffs
High level language implementation
Architecture exploration
FPGA, customized, reconfigurable, heterogeneous
designs
ASIC design
learn
learn
Modular programmable architecture design
DSP, SWAPs
H-SWAPs
14
Physical layer of wireless receivers
Receiver more complex than transmitter
15
Algorithms for
  • Multiple antenna systems (MIMO systems)
  • Complexity exponential with transmit receive
    antennas
  • Wide range of extremely complex algorithms
  • Optimal depends on fading, mobility, bandwidth,
    antennas
  • GOPs of computations
  • Estimation Linear MMSE, blind, conjugate
    gradient.
  • Detection FFT, (blind) interference
    cancellation.
  • Decoding Viterbi, Turbo, LDPC.
  • Implement ALL of them AND the NEXT one in line
  • Use for the best for the situation
  • Example for concept demonstration Viterbi
    decoding

16
Parallel Viterbi Decoding
  • 1. Add-Compare-Select (ACS) trellis
    interconnect
  • Parallelism depends on constraint length
    (states)
  • 2. Conventional Traceback
  • Sequential (No DP)
  • Difficult to implement in parallel architecture
  • Use Register Exchange (RE)
  • parallel solution

17
Re-ordering for parallel Viterbi
  • Exploiting Viterbi DP in SWAPs
  • Re-order ACS, RE
  • Overhead

18
SWAP Algorithms Architecture
  • Algorithm design for parallelism
  • Architecture design?

19
SWAP design
  • Decide how many clusters
  • Exploit DP
  • Decide what to put within each cluster
  • Maximize ILP with high functional unit efficiency
  • Search design space with explore tool
  • See how it meets time-area-power constraints

20
Inside a SWAP cluster EXPLORE
Auto-exploration of adders and multipliers for
ACS"
(Adder FU, Multiplier FU)
21
Explore tool benefits
  • Instruction count vs. functional unit efficiency
  • What goes inside each cluster
  • Explore all algorithms
  • turn off functional units not in use for given
    kernel
  • Design customized application-specific units
  • Better performance with increased FU utilization
  • Algorithm 1 3 adders, 3 multipliers, 32
    clusters
  • Algorithm 2 4 adders, 1 multiplier, 64 clusters
  • Architecture 4 adders, 3 multipliers, 64 clusters

22
Viterbi reconfiguration
DP
Can be turned OFF
Packet 1 Constraint length 7 (16 clusters)
Packet 2 Constraint length 9 (64 clusters)
Packet 3 Constraint length 5 (4 clusters)
23
Viterbi decoding rate 1/2 at 128 Kbps 10 MHz
1000
K 9
K 7
Static architecture
DSP
K 5
SWAPs
100
Frequency needed to attain real-time (in MHz)
10
1
1
10
100
Number of clusters
Ideal C64x (w/o co-proc) needs 200 MHz for
real-time
24
SWAPs Salient features
  • 1-2 orders of magnitude better than 1 processor
    DSP
  • Any constraint length ? 10 MHz at 128 Kbps
  • Same code for all constraint lengths
  • no need to re-compile or load another code
  • as long as parallelism/cluster ratio is constant
  • Power savings due to dynamic cluster scaling

25
Expected SWAP power consumption
  • 64 clusters and 1 multiplier per cluster
  • 0.13 micron, 1.2 V
  • Peak Active Power 9 mW at 1 MHz
  • Area 53.7 mm2
  • 10 MHz, 128 Kbps with reconfiguration

Exploring the VLSI Scalability of Stream
Processors, Brucek Khailany et al, Proceedings of
the Ninth Symposium on High Performance Computer
Architecture, February 8-12, 2003, Anaheim,
California, USA, pp. 153-164
26
Flexibility vs. performance
  • Suitable for mobile devices?
  • SWAPs Real-time at 10-100 mW
  • Maybe but can we do better?
  • ASICs Real-time at 10-100 ?W
  • No special customization for the application
  • No application-specific units
  • Generic inter-cluster communication network
  • Overhead for extracting parallelism
  • SWAPs suitable for base-stations?
  • Why not? power is not a primary constraint!

27
Multiuser Estimation-DetectionDecoding
Real-time target 128 Kbps per user
Ideal C64x (w/o co-proc) needs 15 GHz for
real-time
28
Current research
  • SWAPs Completely flexible and general
  • How do we trade-off flexibility for better
    performance?
  • Handset SWAPs (H-SWAPs)

29
H-SWAPs Potential advantages
DSP (RE)
Execution time
H-SWAPs
SWAPs
30
Conclusions
  • Need flexible architectures for future wireless
    devices
  • Higher data rates, lower power, more complex
    algorithms
  • Design methodology (SWAPs, H-SWAPs, ASICs)
  • Flexibility vs. performance trade-offs
  • Blurs distinction between ASICs and programmable
    solutions
  • Also need parallel, low precision algorithms for
    efficient mapping
  • Inter-disciplinary research
  • Computer architecture, VLSI, wireless
    communications, computer arithmetic, compilers
Write a Comment
User Comments (0)
About PowerShow.com