Adaptive System on a Chip ASOC: A Backbone for PowerAware Signal Processing Cores

About This Presentation

Title:

Adaptive System on a Chip ASOC: A Backbone for PowerAware Signal Processing Cores

Description:

Custom design to maximize speed and reduce power. Core-ports. Crossbar. Controller ... Circuits, architecture and core design projects. Burleson/UMASS ... – PowerPoint PPT presentation

Number of Views:161

Avg rating:3.0/5.0

Slides: 35

Provided by: burl8

Category:

more less

Transcript and Presenter's Notes

Title: Adaptive System on a Chip ASOC: A Backbone for PowerAware Signal Processing Cores

1
Adaptive System on a Chip (ASOC) A Backbone for
Power-Aware Signal Processing Cores

Andrew Laffely, Jian Liang, Russ Tessier and
Wayne Burleson
Electrical and Computer Engineering
University of Massachusetts Amherst
burleson_at_ecs.umass.edu

This material is based upon work supported by the
National Science Foundation under Grant No.
9988238 and SRC Tasks 766 and 1075
2
Challenges in Media Processing

Increasingly complex, heterogeneous algorithms
Variable run-times (e.g. data-dependent
iterations)
Variable quality
Variable power consumption
Large data-sets, usually streaming
Memory size, ports and latency issues
Advancing semiconductor technology (Moores Law)
Interconnect (on-chip and I/O)
Clocking
Power (consumption and distribution)
Design and Verification

3
aSoC adaptive System on a Chip

Tiled SoC architecture

4
aSoC adaptive System on a Chip

Tiled SoC architecture
Supports the use of independently developed
heterogeneous cores
Pick and place cores which best perform the given
application
Increase performance
Save power
Cores may be any number of tiles in size

5
aSoC adaptive System on a Chip

Tiled SoC architecture
Supports the use of independently developed
heterogeneous cores
Connected with an interconnect mesh
Restricted to near neighbor communications
Creates pipeline
Decreases cycle time

6
aSoC adaptive System on a Chip

Tiled SoC architecture
Supports the use of independently developed
heterogeneous cores
Connected with an optimized fixed interconnect
mesh
Using a communication interface (CI) to manage
data
Network port (Coreport) for each core, I/O
queues,handshake
Each CI uses a memory and FSM to repetitively
process a predefined (static) schedule of
communications
High-speed 5x5 bidirectional crossbar

7
Communication Interface
Core

Custom design to maximize speed and reduce power
Core-ports
Crossbar
Controller
Instruction memory
Local frequency and voltage supply

Core-ports
North
North
South
South
East
East
West
West
Outputs
Inputs
Local Config.
Local Frequency Voltage
Decoder
Controller
Crossbar
North to South East
PC
Instruction Memory
8
aSoC Implementation and Integration
2500 l
.18m TSMC technology Full custom
3000 l
9
Research Thrusts

aSoC Infrastructure1,3
Communication Interface
Interconnect3
Power Distribution
Clock System
Power Management
Design Technology
Compiler1,3 (Partitioner, Mapper, Placer,
Scheduler)
Simulator1

Cores
Motion estimation2,3
Discrete Cosine Transform2,3
AES Cryptography3
Huffman Coding
Adaptive Viterbi2,3
3D Graphics1,2,3
Smart Card2,3
MP3
ARM
DSP
Cache2,3
FPGA
MAC

1 PhD Dissertation 2 Masters Thesis 3
Publications
10
Voltage Scaling Approach

Core-ports
Single buffer for each stream to cross
clock/voltage barrier between core and interface
Reading/Writing success rates indicate core
utilization
Input blocked Core too slow
Output blocked Core too fast
Controller
Interprets core-port success rates to adjust
local clock and voltage

Core
Processing Pipeline
Local Vdd
Local Clock
Buffer
Input Core-port
Output Core-port
Clock and Supply Controller
Blocked
Blocked
Interconnect
11
Vdd Selection Criteria
Normalized Core Critical Path Delay vs. Vdd
12
Normalized Delay

As Vdd decreases delay increases exponentially
Use curve to match available clock frequencies to
voltages
The voltage and frequency change reduces power by
79, 96, and 98.7
P aC(Vdd)2f

10
1/8 Speed
8
6
1/4 Speed
4
1/2 Speed
2
Max Speed
0
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
0.73
1.16
Voltage
12
Architecture Evaluation(Motion Estimation)
Memory

Array-based architecture
Pipelined ME
Parameterized search window size
Full search
Choose 16x16 or 8x8 windows
Reduce power

FIFOs
Address Generation Unit
Processing Element Array
13
Power Aware Core

Custom motion estimation core
Choose search method
Full search
960-600mW (bit width and pel sub-sampling)
Spiral search
76mW
Three step search
25mW
Data taken with SynopsysTM Power Compiler at the
RTL level

14
aSoC Support

Multiple streams in and out through dedicated
core ports
Easy to manage on both sides of the port
Schedule configuration streams in with the data
Stream A Input Frame
Stream B Configuration (Choose search mode and
size)
Stream C Motion Vectors

Motion Estimation Core
Coreports
in1
in2
out2
out1
Stream A
Stream C
Stream B
15
Reconfigurable Interconnect

P-frame
I-frame

S
DCT
-
Input Frame
ME
MC
DCT
Input Frame
16
aSoC Support
Motion Estimation Compensation
DCT

Lumped ME, MC and Summation into one double core

17
aSoC Support P-Frame
Motion Estimation Compensation
DCT
Input Frame (Stream A)
Difference Frame (Stream B)
18
aSoC Support Schedule Change
Motion Estimation Compensation
DCT
Input Frame (Stream A)
Difference Frame (Stream B)
Configuration Streams (C D)
19
aSoC Support Schedule Change
Motion Estimation Compensation
DCT
Input Frame (Stream A)
Difference Frame (Stream B)
Schedule 1
PC
Schedule 2
Configuration (Streams C)
20
aSoC Support Schedule Change
Motion Estimation Compensation
DCT
Input Frame (Stream A)
Difference Frame (Stream B)
Schedule 1
PC
Schedule 2
Configuration (Streams C)
21
aSoC Support Schedule Change
Motion Estimation Compensation
DCT
Input Frame (Stream A)
Schedule 1
PC
Schedule 2
Configuration (Streams D)
22
aSoC Support Schedule Change
Motion Estimation Compensation
DCT
Input Frame (Stream A)
Schedule 1
PC
Schedule 2
Configuration (Streams D)
23
aSoC Support I-Frame
OFF
Motion Estimation Compensation
DCT
Input Frame (Stream A)
24
Operating Frequency?

Interconnect synchronized
H-tree clock distribution
Core frequencies depend on critical path
Tile provides clock reference
Coreport provides asynchronous boundary
Dynamic core configuration requires dynamic clock
configuration
aSoC clock reference provides multiples of
interconnect clock ( 4x, 2x, 1x, 0.5x, 0.25x, )
Configured through the tile controller

25
Clock Distribution
Tile

Tiled architecture extends life of globally
synchronous systems
Precise H-tree implementation
Load is small and equal at each branch
Skew can be reduced by 70 with advanced deskew
circuits1

1 S. Tan et al. Clock Generation and
Distribution for the First IA-64 Microprocessor
IEEE JSSC, Nov. 2000
26
Mixed vs. Fixed Core Frequencies

Cores not designed with clock gating
Core power from Synopsys RTL simulation
Interconnect from SPICE
Assumes 10 cycle schedule, 4 pixels/word

27
Current Density and Clocking

Red fixed worst case clocking
Short spikes of high current
Green optimal independent clocking
Slow and low
Optimal clocking eliminates current spikes (also
improved battery life)

ME Full Search ME Spiral ME Three Step
Search DCT
Current
Time
Deadline
Process Start
28
Power Distribution

Heterogeneous power-aware cores require multiple
power supply voltages
Tile structure enables uniform interwoven grid
Larger grid for higher current demands
Reduced resistance
Higher capacitance

Gnd
Vml
Vl
Vmh
Vh
29
Advanced Signaling Techniques (building on
SRC-funded work)
Differential current sensing
Booster Insertion
Multi-level current signaling
Phase coding
30
Interconnect CharacterizationComparing delay
and power of signaling techniques for different
tile sizes at 250nm, 180nm, 130nm, 100n
(available via web-based tool Network on Chip
Interconnect Calculator NOCIC)
31
Conclusions

Regular Tiled Architecture
Task-based parallelism using heterogeneous cores
Predictable interconnect
Regular core interface, Vdd and clock control,
and configuration control
Static scheduling
High-level global schedule of inter-core
communication
Accomodates dynamic workloads with queues and
local handshakes
Demonstration using Motion Estimation and DCT
Variable search window and search algorithm
provide power/quality tradeoff
Power savings using scalable approaches to
dynamic clock and power variation
Simple clock dividers leveraging existing clock
distribution methods
Route multiple power supplies to allow rapid
switching and avoid overhead of on-chip power
regulation

32
Ongoing Work

Satellite Set-top Box application
Developed at Hughes Networks using 7 distinct
RISC cores. Compare ASOC with in-house shared
memory approach for interconnections.
New and more complete wireless and multimedia
systems
Jpeg2000, mpeg-4, 3d Graphics,
ASOC parameter optimization
Tile sizes, bus widths, clocks, VDDs
Coping with Core irregularity
Size, I/O positions, shapes, bus widths,
communication interfaces
Interconnect circuit optimization (NoCIC)
Leakage Power issues
Reliability, Test, Fault-Tolerance and Security
Compilation especially Partitioning, Mapping
Prototypes .18u MOSIS of communication
interface, 25K transistors, verification of
interface logic and timing
ASOC in Education Circuits, architecture and
core design projects

33
Implications (perhaps controversial ?)

Multi-core architectures will be needed to
maintain Moores law (interconnect, memory,
parallelism)
Task-based parallelism may be easier to program,
extract and implement than data parallelism
(think multi-core rather than instruction level
parallelism)
Global coarse synchronization provides an
approach to hard-real time computing for dynamic
workloads (ie video coding).
Dynamic Power savings exploiting fine-grain
workload variations can be achieved through
straightforward clock and power scaling methods.
Interconnect standards will be specified by
silicon foundries similar to cell libraries and
memories

34
Design Flowhttp//vsp2.ecs.umass.edu/vspg/658/TA_
Tools/design_flow.html

Architecture to Layout
Architecture Block diagram of system and
behavioral description
Logic Gate level or schematic description
Circuit Transistor configurations and sizings
Layout Floorplanning, clock and power
distribution
Tools
VerilogXL behavioral representation
VTVT standard cell library
Synopsys standard cell gate level netlist
generation
Silicon Ensemble standard cell netlist to layout
Cadence LayoutPlus schematic and layout design
NCSU CDK design and extraction rules
Cadence Layout vs. Schematic layout verification
HSPICE circuit simulator

Write a Comment

User Comments (0)

About PowerShow.com

Adaptive System on a Chip ASOC: A Backbone for PowerAware Signal Processing Cores - PowerPoint PPT Presentation

Adaptive System on a Chip ASOC: A Backbone for PowerAware Signal Processing Cores

Custom design to maximize speed and reduce power. Core-ports. Crossbar. Controller ... Circuits, architecture and core design projects. Burleson/UMASS ... – PowerPoint PPT presentation