Programmable processors for wireless base-stations - PowerPoint PPT Presentation

1 / 47
About This Presentation
Title:

Programmable processors for wireless base-stations

Description:

Wireless rates clock rates. Need to process 100X more bits per clock cycle ... Base-stations need horsepower. Sophisticated signal processing for multiple users ... – PowerPoint PPT presentation

Number of Views:18
Avg rating:3.0/5.0
Slides: 48
Provided by: Srid
Category:

less

Transcript and Presenter's Notes

Title: Programmable processors for wireless base-stations


1
Programmable processors for wireless base-stations
  • Sridhar Rajagopal
  • (sridhar_at_rice.edu)
  • December 16, 2003

2
Wireless rates ? clock rates
4 GHz
54-100 Mbps
200 MHz
2-10 Mbps
1 Mbps
9.6 Kbps
  • Need to process 100X more bits per clock cycle
    today than in 1996

3
Base-stations need horsepower
Sophisticated signal processing for multiple
users Need 100-1000s of arithmetic operations to
process 1 bit Base-stations require gt 100 ALUs
4
Programmable architectures
  • Wireless algorithm kernels
  • Well known, ASIC mapping well-studied
  • Processors getting more powerful every year
  • Historic trend ASICs ? Programmable
  • Can we design a fully programmable wireless
    system?

5
Thesis addresses the following problem
  • Design programmable processors for wireless
    base-stations with 100s of ALUs
  • map wireless algorithms on these processors
  • power-efficient (adapt resources to needs)
  • (c) decide ALUs, clock frequency

how much programmable? as programmable as
possible
6
Choice Multi-processors
  • Single processors wont do
  • ILP, subword parallelism not sufficient
  • Register file explosion with increasing ALUs
  • Multiprocessors
  • Data parallelism in wireless systems
  • Data-parallel/SIMD/vector processors appropriate
  • Exploit ILP, MMX, DP

7
Thesis contributions
  • (a)Mapping algorithms on data-parallel processors
  • designing data-parallel algorithms
  • tradeoffs between packing, ALU utilization and
    memory
  • reduced inter-cluster communication network
  • (b)Improve power efficiency
  • adapting compute resources to workload variations
  • varying voltage and frequency to real-time
    requirements
  • (c) Design exploration between ALUs and clock
    frequency to minimize power consumption
  • fast real-time performance prediction

8
Outline
  • Background
  • Wireless systems
  • Data-parallel (Stream) processors
  • Mapping algorithms to stream processors
  • Power efficiency
  • Design exploration
  • Broad impact and future work

9
Wireless workloads
System 2G 3G 4G
Users Data rates Algorithms Estimation Detection Decoding Theoretical Min ALUs _at_ 1 GHz 32 16 Kbps /user Single-user Correlator Matched filter Viterbi gt 2 32 128 Kbps/user Multi-user Max. likelihood Interference Cancellation Viterbi gt 20 32 1 Mbps/user MIMO Chip equalizer Matched filter LDPC gt 200
Time 1996
2003 ?
10
Key kernels studied for wireless
  • FFT Media processing
  • QRD Media processing
  • Outer product updates
  • Matrix vector operations
  • matrix matrix operations
  • Matrix transpose
  • Viterbi decoding
  • LDPC decoding (in progress)

11
Characteristics of wireless
  • Compute-bound
  • Finite precision
  • Limited temporal data reuse
  • Streaming data
  • Data parallelism
  • Static, deterministic, regular workloads
  • Limited control flow

12
Parallelism levels in wireless
  • int i,aN,bN,sumN // 32 bits
  • short int cN,dN,diffN // 16 bits packed
  • for (i 0 ilt 1024 i)
  • sumi ai bi
  • diffi ci - di
  • Instruction Level Parallelism (ILP) - DSP
  • Subword Parallelism (MMX) - DSP
  • Data Parallelism (DP) Vector Processor
  • DP can decrease by increasing ILP and MMX
  • Example loop unrolling

DP
ILP
MMX
13
Stream Processors multi-cluster DSPs
Memory Stream Register File (SRF)










ILP MMX















DP
adapt clusters to DP Identical clusters, same
operations. Power-down unused FUs, clusters
VLIW DSP (1 cluster)
14
Outline
  • Background
  • Wireless systems
  • Stream processors
  • Mapping algorithms to stream processors
  • Reduced inter-cluster communication network
  • Power efficiency
  • Design exploration
  • Broad impact and future work

15
Patterns in inter-cluster comm
  • Intercluster comm network fully connected
  • Structure in access patterns can be exploited
  • Broadcasting
  • Matrix-vector multiplication, matrix-matrix
    multiplication, outer product updates
  • Odd-even grouping
  • Transpose, Packing, Viterbi decoding

16
Viterbi needs odd-even grouping
  • Exploiting Viterbi DP
  • Odd-even grouping of trellis states

17
Performance of Viterbi decoding
Ideal C64x DSP (w/o co-proc) needs 200 MHz for
real-time
18
Odd-even grouping
  • Packing
  • If odd-even data packed in same cluster and
    precision doubles
  • Odd-even grouping required for bringing data to
    right cluster
  • Not always beneficial for performance
  • Matrix transpose
  • Better done in ALUs than in memory
  • Shown to have an order-of-magnitude better
    performance
  • Done in ALUs as repeated odd-even groupings

19
Transpose uses odd-even grouping
20
Odd-even grouping
0 1 2 3 4 5 6 7 ? 0 2 4 6 1 3 5 7
Inter-cluster communication
Entire chip length Limits clock frequency Limits
scaling
21
A reduced inter-cluster comm network
only nearest neighbor interconnections
22
Outline
  • Background
  • Wireless systems
  • Stream processors
  • Mapping algorithms to stream processors
  • Power efficiency
  • Design exploration
  • Broad impact and future work

23
Flexibility needed in workloads
Billions of computations per second
needed Workload variation from 1 GOPs for 4
users, constraint 7 viterbi to 23 GOPs for 32
users, constraint 9 viterbi
24
DP changes with users
25
Data is not in the right banks
  • 4 ? 2 clusters
  • Data not in the right SRF banks
  • Overhead in bringing data to the right banks
  • Via memory
  • Via inter-cluster communication network

26
Adapting clusters to Data Parallelism
SRF
Turned off using voltage gating to eliminate
static and dynamic power dissipation
Adaptive Multiplexer Network
Clusters
C
C
C
C
No reconfiguration
4 2 reconfiguration
41 reconfiguration
All clusters off
C
C
C
27
Cluster utilization variation
Cluster Index
Cluster utilization variation on a 32-cluster
processor (32, 9) 32 users, constraint length
9 Viterbi
28
Frequency variation
29
Operation
  • Dynamic Voltage-Frequency scaling when system
    changes significantly
  • Users, data rates
  • Coarse time scale (when system changes)
  • Turn off clusters
  • when parallelism changes
  • Finer time scale (once every 1000 cycles) (di/dt
    effects)
  • Memory operations
  • Exceed real-time requirements

30
Power Voltage Gating Scaling
Power can change from 12.38 W to 300 mW (40x
savings) depending on workload changes
31
Outline
  • Background
  • Wireless systems
  • Stream processors
  • Mapping algorithms to stream processors
  • Power efficiency
  • Design exploration
  • Broad impact and future work

32
Deciding ALUs vs. clock frequency
  • No independent variables
  • Clusters, ALUs, frequency, voltage (c,a,m,f)
  • Trade-offs exist
  • How to find the right combination for lowest
    power!

33
Static design exploration
also helps in quickly predicting real-time
performance
34
Sensitivity analysis important
  • We have a capacitance model Khailany2003
  • All equations not exact
  • Need to see how variations affect solutions

35
Design exploration methodology
  • 3 types of parallelism ILP, MMX, DP
  • For best performance (power)
  • Maximize the use of all
  • Maximize ILP and MMX at expense of DP
  • Loop unrolling, packing
  • Schedule on sufficient number of
    adders/multipliers
  • If DP remains, set clusters DP
  • No other way to exploit that parallelism

36
Setting clusters, adders, multipliers
  • If sufficient DP, linear decrease in frequency
    with clusters
  • Set clusters depending on DP and execution time
    estimate
  • To find adders and multipliers,
  • Let compiler schedule algorithm workloads across
    different numbers of adders and multipliers and
    let it find execution time
  • Put all numbers in power equation
  • Compare increase in capacitance due to added ALUs
    and clusters with benefits in execution time
  • Choose the solution that minimizes the power

37
Design exploration for clusters (c)
DP
time
  • For sufficiently large
  • adders, multipliers per cluster
  • Explore Algorithm 1 32 clusters
  • Explore Algorithm 2 64 clusters
  • Explore Algorithm 3 64 clusters
  • Explore Algorithm 4 16 clusters

38
Clusters frequency and power
32 clusters at frequency 836.692 MHz (p 1) 64
clusters at frequency 543.444 MHz (p 2) 64
clusters at frequency 543.444 MHz (p 3)
3G workload
39
ALU utilization with frequency
3G workload
Relation between ALU utilization and power
minimization?
40
Choice of adders and multipliers
(?,fp) Optimal Optimal ALU/Cluster Cluster/Total
Adders Multipliers Power Power
(0.01,1) 2 1 30 61
(0.01,2) 2 1 30 61
(0.01,3) 3 1 25 58
(0.1,1) 2 1 52 69
(0.1,2) 2 1 52 69
(0.1,3) 3 1 51 68
(1,1) 1 1 86 89
(1,2) 2 2 84 87
(1,3) 2 2 84 87
41
Exploration results
  • Final Design Conclusion
  • Clusters 64
  • Multipliers/cluster 1
  • Multiplier Utilization 62
  • Adders/cluster 3
  • Adder Utilization 55
  • Real-time frequency 568.68 MHz for 128
    Kbps/user
  • Exploration done in seconds.

42
Outline
  • Background
  • Wireless systems
  • Stream processors
  • Mapping algorithms to stream processors
  • Power efficiency
  • Design exploration
  • Broad impact and future work

43
Broader impact
  • Results not specific to base-stations
  • High performance, low power system designs
  • Concepts can be extended to handsets
  • Mux network applicable to all SIMD processors
  • Power efficiency in scientific computing
  • Results 2, 3 applicable to all stream
    applications
  • Design and power efficiency
  • Multimedia, MPEG,

44
Future work
  • Dont believe the model is the reality
  • Fabrication needed to verify concepts
  • Cycle accurate simulator
  • Extrapolating models for power
  • LDPC decoding (in progress)
  • Sparse matrix requires permutations over large
    data
  • Indexed SRF may help
  • 3G requires 1 GHz at 128 Kbps/user
  • 4G equalization at 1 Mbps breaks down (expected)

45
Options for higher performance
  • Multi-threading (ILP, MMX, DP, MT)
  • Schedule other kernels on unused clusters
  • Additional microcontroller and issue logic
    complexity
  • Pipelining (ILP, MMX, DP, MT, PP)
  • Standard way of improving performance
  • Inter-processor communication overhead
  • Load-balancing difficult
  • min(t1,t2,) instead of min(t1t2,)
  • Software tools need to catch up with hardware

46
Need for new architectures, definitions and
benchmarks
  • Road ends - conventional architecturesAgarwal2000
  • Wide range of architectures DSP, ASSP, ASIP,
    reconfigurable,stream, ASIC, programmable
  • Difficult to compare and contrast
  • Need new definitions that allow comparisons
  • Wireless workloads
  • Typically ASIC designs
  • SPEC benchmark needed for programmable designs

47
Conclusions
  • Utilizing 100-1000s ALUs/clock cycle and mapping
    algorithms not easy in programmable architectures
  • Data parallel algorithms need to be designed and
    mapped
  • Power efficiency needs to be provided
  • Design exploration needed to decide ALUs to meet
    real-time constraints
  • My thesis lays the initial foundations
Write a Comment
User Comments (0)
About PowerShow.com