Title: LowEnergy MultiUser VSELP Vocoder for a Domain Specific Reconfigurable DSP Architecture
1Low-Energy Multi-User VSELP Vocoder for a Domain
Specific Reconfigurable DSP Architecture
Roy A. Sutton University of California, Berkeley
CS 294-1 Spring 1998
2Typical Digital Cellular phone
Encode
ADC
Modulate
Encode
Duplex
Controller
DAC
Demodulate
Decode
Use CODEC to compress the data stream!
3Typical Digital Base Station
PBX
CH
R/F
...
Can we replace multiple CODEC with a single and
save Space? Cost? Energy?
4How to Save Energy?
- Eliminate idle time Weiser, etal., OSDI-94
- Reduce energy per Op Chandrakassan, etal.
ISLPED-96 - Work just fast enough and use most efficient
processing for most repetitious work!
5CODEC Data Stream Observations
- Users dont talk continuously (Data is sporadic
and bursty) - During silence, very little processing is
required - No benefit from processing data faster than
required - Data arrives in packets which need be processed
just fast enough for steady state
6Which CODEC? answer VSELP
- Compression factor 8x (64k bps to 8k bps)
- Frame size 160 bits
- Output rate 20ms / frame (160 b / 8k bps)
- Divided into 4 sub-frames
7What is the Repetitious Part?
8VSELP Repetitious Part (cont)
- 5 of code runs 70 of time and 92 of hits
(ignoring theta) - 5 found in two functions 1) dot_product 2)
iiRfilter - Make sure these two computations are efficient!
9HW Implementation Strategy
- Run (the 5) repetitious part on special hardware
optimized for these two functions - Run (the 95) remaining part on general purpose
processor - Use Pleiades architecture template as guide to
select architecture instance Pleiades
10HW for dot_product and iiRfilter
- Can implement using address generators, memory
elements, and MAC units - dot_product
- 2 Add Gen
- 2 Memory
- 1 MAC
- iiRfilter
- 3 Add Gen
- 3 Memory
- 1 MAC
11Architecture Instance
Memory
Memory
CPU
Memory
Add Gen
Add Gen
Add Gen
Interconnect Network
MAC
MAC
Add Gen
Add Gen
Memory
Memory
1 CPU, 12 Satellites, Networked A Domain Specific
Reconfigurable DSP Architecture!
12HW Simplifying Assumptions
- Satellites and network configuration time is
ignored - Network is configured once at startup
(statically) - Satellites may be run-time configured
(dynamically) - Hardware performance tracks 12 gate ring
oscillator over voltage (1 - 3 volts) - Satellites can be accessed by one thread at a
time - The network consumes no energy!
13Processing 1 Stream
- Stream processed by a single thread
- Monitor the stream input buffer level
- Adjust the task priority and hardware throughput
as required - Thread uses Satellite processors for repetitive
code
Q
scheduler
bl1
tp
p1
Q
14Processing 4 Streams
Q
Q
hw
hw
hw
Threads compete for Satellites Priority now
important
hw
hw
hw
15Thread Scheduling
- Exists
- Use preemptive multithreaded scheduling
- Dispatch highest priority thread for next time
slice - Extensions
- Dynamically adjust each thread priority base on
its workload - Dynamically adjust total hardware throughput
based on aggregate workload
16Priority and Throughput Adaptation
- 4 Performance Levels (TP)
- for given fsample required, pick voltage via LUT
- Sub-frames mark queue levels (0, 40, 80, 120,
160) - Adjust processor throughput and task priority by
viewing queue levels
17Satellite Access Simulation Trace
18Results
19Conclusions
- Sporadic data stream with fixed throughput can be
computed with reduced energy by stretching
computation in time - Using specialized processors for redundant
computation can reduce time and energy - Multiple sporadic data streams can be viewed as a
single with aggregate duty with slight overhead - Always compute using maximum time allowable (and
adjust processor throughput) to minimize energy
20Future Work
- Account for interconnect network energy
consumption - Investigate critical instance adaptation behavior
and adaptation transients - Account for satellite and network configuration
time - Hide configuration time of satellites by setting
up next operation during current - Consider different satellite selection /
configuration - When do energy reduction returns diminish?
- What about sensitivity for thread time slice?