Adaptive System on a Chip ASOC: A Backbone for PowerAware Signal Processing Cores - PowerPoint PPT Presentation

1 / 34
About This Presentation
Title:

Adaptive System on a Chip ASOC: A Backbone for PowerAware Signal Processing Cores

Description:

Custom design to maximize speed and reduce power. Core-ports. Crossbar. Controller ... Circuits, architecture and core design projects. Burleson/UMASS ... – PowerPoint PPT presentation

Number of Views:161
Avg rating:3.0/5.0
Slides: 35
Provided by: burl8
Category:

less

Transcript and Presenter's Notes

Title: Adaptive System on a Chip ASOC: A Backbone for PowerAware Signal Processing Cores


1
Adaptive System on a Chip (ASOC) A Backbone for
Power-Aware Signal Processing Cores
  • Andrew Laffely, Jian Liang, Russ Tessier and
    Wayne Burleson
  • Electrical and Computer Engineering
  • University of Massachusetts Amherst
  • burleson_at_ecs.umass.edu

This material is based upon work supported by the
National Science Foundation under Grant No.
9988238 and SRC Tasks 766 and 1075
2
Challenges in Media Processing
  • Increasingly complex, heterogeneous algorithms
  • Variable run-times (e.g. data-dependent
    iterations)
  • Variable quality
  • Variable power consumption
  • Large data-sets, usually streaming
  • Memory size, ports and latency issues
  • Advancing semiconductor technology (Moores Law)
  • Interconnect (on-chip and I/O)
  • Clocking
  • Power (consumption and distribution)
  • Design and Verification

3
aSoC adaptive System on a Chip
  • Tiled SoC architecture

4
aSoC adaptive System on a Chip
  • Tiled SoC architecture
  • Supports the use of independently developed
    heterogeneous cores
  • Pick and place cores which best perform the given
    application
  • Increase performance
  • Save power
  • Cores may be any number of tiles in size

5
aSoC adaptive System on a Chip
  • Tiled SoC architecture
  • Supports the use of independently developed
    heterogeneous cores
  • Connected with an interconnect mesh
  • Restricted to near neighbor communications
  • Creates pipeline
  • Decreases cycle time

6
aSoC adaptive System on a Chip
  • Tiled SoC architecture
  • Supports the use of independently developed
    heterogeneous cores
  • Connected with an optimized fixed interconnect
    mesh
  • Using a communication interface (CI) to manage
    data
  • Network port (Coreport) for each core, I/O
    queues,handshake
  • Each CI uses a memory and FSM to repetitively
    process a predefined (static) schedule of
    communications
  • High-speed 5x5 bidirectional crossbar

7
Communication Interface
Core
  • Custom design to maximize speed and reduce power
  • Core-ports
  • Crossbar
  • Controller
  • Instruction memory
  • Local frequency and voltage supply

Core-ports
North
North
South
South
East
East
West
West
Outputs
Inputs
Local Config.
Local Frequency Voltage
Decoder
Controller
Crossbar
North to South East
PC
Instruction Memory
8
aSoC Implementation and Integration
2500 l
.18m TSMC technology Full custom
3000 l
9
Research Thrusts
  • aSoC Infrastructure1,3
  • Communication Interface
  • Interconnect3
  • Power Distribution
  • Clock System
  • Power Management
  • Design Technology
  • Compiler1,3 (Partitioner, Mapper, Placer,
    Scheduler)
  • Simulator1
  • Cores
  • Motion estimation2,3
  • Discrete Cosine Transform2,3
  • AES Cryptography3
  • Huffman Coding
  • Adaptive Viterbi2,3
  • 3D Graphics1,2,3
  • Smart Card2,3
  • MP3
  • ARM
  • DSP
  • Cache2,3
  • FPGA
  • MAC

1 PhD Dissertation 2 Masters Thesis 3
Publications
10
Voltage Scaling Approach
  • Core-ports
  • Single buffer for each stream to cross
    clock/voltage barrier between core and interface
  • Reading/Writing success rates indicate core
    utilization
  • Input blocked Core too slow
  • Output blocked Core too fast
  • Controller
  • Interprets core-port success rates to adjust
    local clock and voltage

Core
Processing Pipeline
Local Vdd
Local Clock
Buffer
Input Core-port
Output Core-port
Clock and Supply Controller
Blocked
Blocked
Interconnect
11
Vdd Selection Criteria
Normalized Core Critical Path Delay vs. Vdd
12
Normalized Delay
  • As Vdd decreases delay increases exponentially
  • Use curve to match available clock frequencies to
    voltages
  • The voltage and frequency change reduces power by
    79, 96, and 98.7
  • P aC(Vdd)2f

10
1/8 Speed
8
6
1/4 Speed
4
1/2 Speed
2
Max Speed
0
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
0.73
1.16
Voltage
12
Architecture Evaluation(Motion Estimation)
Memory
  • Array-based architecture
  • Pipelined ME
  • Parameterized search window size
  • Full search
  • Choose 16x16 or 8x8 windows
  • Reduce power

FIFOs
Address Generation Unit
Processing Element Array
13
Power Aware Core
  • Custom motion estimation core
  • Choose search method
  • Full search
  • 960-600mW (bit width and pel sub-sampling)
  • Spiral search
  • 76mW
  • Three step search
  • 25mW
  • Data taken with SynopsysTM Power Compiler at the
    RTL level

14
aSoC Support
  • Multiple streams in and out through dedicated
    core ports
  • Easy to manage on both sides of the port
  • Schedule configuration streams in with the data
  • Stream A Input Frame
  • Stream B Configuration (Choose search mode and
    size)
  • Stream C Motion Vectors

Motion Estimation Core
Coreports
in1
in2
out2
out1
Stream A
Stream C
Stream B
15
Reconfigurable Interconnect
  • P-frame
  • I-frame


S
DCT
-
Input Frame
ME
MC
DCT
Input Frame
16
aSoC Support
Motion Estimation Compensation
DCT
  • Lumped ME, MC and Summation into one double core

17
aSoC Support P-Frame
Motion Estimation Compensation
DCT
Input Frame (Stream A)
Difference Frame (Stream B)
18
aSoC Support Schedule Change
Motion Estimation Compensation
DCT
Input Frame (Stream A)
Difference Frame (Stream B)
Configuration Streams (C D)
19
aSoC Support Schedule Change
Motion Estimation Compensation
DCT
Input Frame (Stream A)
Difference Frame (Stream B)
Schedule 1
PC
Schedule 2
Configuration (Streams C)
20
aSoC Support Schedule Change
Motion Estimation Compensation
DCT
Input Frame (Stream A)
Difference Frame (Stream B)
Schedule 1
PC
Schedule 2
Configuration (Streams C)
21
aSoC Support Schedule Change
Motion Estimation Compensation
DCT
Input Frame (Stream A)
Schedule 1
PC
Schedule 2
Configuration (Streams D)
22
aSoC Support Schedule Change
Motion Estimation Compensation
DCT
Input Frame (Stream A)
Schedule 1
PC
Schedule 2
Configuration (Streams D)
23
aSoC Support I-Frame
OFF
Motion Estimation Compensation
DCT
Input Frame (Stream A)
24
Operating Frequency?
  • Interconnect synchronized
  • H-tree clock distribution
  • Core frequencies depend on critical path
  • Tile provides clock reference
  • Coreport provides asynchronous boundary
  • Dynamic core configuration requires dynamic clock
    configuration
  • aSoC clock reference provides multiples of
    interconnect clock ( 4x, 2x, 1x, 0.5x, 0.25x, )
  • Configured through the tile controller

25
Clock Distribution
Tile
  • Tiled architecture extends life of globally
    synchronous systems
  • Precise H-tree implementation
  • Load is small and equal at each branch
  • Skew can be reduced by 70 with advanced deskew
    circuits1

1 S. Tan et al. Clock Generation and
Distribution for the First IA-64 Microprocessor
IEEE JSSC, Nov. 2000
26
Mixed vs. Fixed Core Frequencies
  • Cores not designed with clock gating
  • Core power from Synopsys RTL simulation
  • Interconnect from SPICE
  • Assumes 10 cycle schedule, 4 pixels/word

27
Current Density and Clocking
  • Red fixed worst case clocking
  • Short spikes of high current
  • Green optimal independent clocking
  • Slow and low
  • Optimal clocking eliminates current spikes (also
    improved battery life)

ME Full Search ME Spiral ME Three Step
Search DCT
Current
Time
Deadline
Process Start
28
Power Distribution
  • Heterogeneous power-aware cores require multiple
    power supply voltages
  • Tile structure enables uniform interwoven grid
  • Larger grid for higher current demands
  • Reduced resistance
  • Higher capacitance

Gnd
Vml
Vl
Vmh
Vh
29
Advanced Signaling Techniques (building on
SRC-funded work)
Differential current sensing
Booster Insertion
Multi-level current signaling
Phase coding
30
Interconnect CharacterizationComparing delay
and power of signaling techniques for different
tile sizes at 250nm, 180nm, 130nm, 100n
(available via web-based tool Network on Chip
Interconnect Calculator NOCIC)
31
Conclusions
  • Regular Tiled Architecture
  • Task-based parallelism using heterogeneous cores
  • Predictable interconnect
  • Regular core interface, Vdd and clock control,
    and configuration control
  • Static scheduling
  • High-level global schedule of inter-core
    communication
  • Accomodates dynamic workloads with queues and
    local handshakes
  • Demonstration using Motion Estimation and DCT
  • Variable search window and search algorithm
    provide power/quality tradeoff
  • Power savings using scalable approaches to
    dynamic clock and power variation
  • Simple clock dividers leveraging existing clock
    distribution methods
  • Route multiple power supplies to allow rapid
    switching and avoid overhead of on-chip power
    regulation

32
Ongoing Work
  • Satellite Set-top Box application
  • Developed at Hughes Networks using 7 distinct
    RISC cores. Compare ASOC with in-house shared
    memory approach for interconnections.
  • New and more complete wireless and multimedia
    systems
  • Jpeg2000, mpeg-4, 3d Graphics,
  • ASOC parameter optimization
  • Tile sizes, bus widths, clocks, VDDs
  • Coping with Core irregularity
  • Size, I/O positions, shapes, bus widths,
    communication interfaces
  • Interconnect circuit optimization (NoCIC)
  • Leakage Power issues
  • Reliability, Test, Fault-Tolerance and Security
  • Compilation especially Partitioning, Mapping
  • Prototypes .18u MOSIS of communication
    interface, 25K transistors, verification of
    interface logic and timing
  • ASOC in Education Circuits, architecture and
    core design projects

33
Implications (perhaps controversial ?)
  • Multi-core architectures will be needed to
    maintain Moores law (interconnect, memory,
    parallelism)
  • Task-based parallelism may be easier to program,
    extract and implement than data parallelism
    (think multi-core rather than instruction level
    parallelism)
  • Global coarse synchronization provides an
    approach to hard-real time computing for dynamic
    workloads (ie video coding).
  • Dynamic Power savings exploiting fine-grain
    workload variations can be achieved through
    straightforward clock and power scaling methods.
  • Interconnect standards will be specified by
    silicon foundries similar to cell libraries and
    memories

34
Design Flowhttp//vsp2.ecs.umass.edu/vspg/658/TA_
Tools/design_flow.html
  • Architecture to Layout
  • Architecture Block diagram of system and
    behavioral description
  • Logic Gate level or schematic description
  • Circuit Transistor configurations and sizings
  • Layout Floorplanning, clock and power
    distribution
  • Tools
  • VerilogXL behavioral representation
  • VTVT standard cell library
  • Synopsys standard cell gate level netlist
    generation
  • Silicon Ensemble standard cell netlist to layout
  • Cadence LayoutPlus schematic and layout design
  • NCSU CDK design and extraction rules
  • Cadence Layout vs. Schematic layout verification
  • HSPICE circuit simulator
Write a Comment
User Comments (0)
About PowerShow.com