Task 1: System Architecture and Circuit Innovations - PowerPoint PPT Presentation

1 / 47
About This Presentation
Title:

Task 1: System Architecture and Circuit Innovations

Description:

Pair up processors within a Hydra quad. Processors compare results and retry if they disagree ... Hydra Speculation Support. Speculation coprocessor to ... – PowerPoint PPT presentation

Number of Views:30
Avg rating:3.0/5.0
Slides: 48
Provided by: scott384
Category:

less

Transcript and Presenter's Notes

Title: Task 1: System Architecture and Circuit Innovations


1
Task 1 System Architectureand Circuit
Innovations
  • Anantha Chandrakasan MIT
  • Bill Dally Stanford
  • Mark Horowitz Stanford
  • Kunle Olukotun Stanford
  • Scott Wills Georgia Tech

Interconnect Focus Center
e
e
e
e
2
Interconnect and Gate Delays
from ITRS99
3
Overcoming Interconnect Limitations
  • Improve Interconnect Performance
  • Improve Interconnect Utilization
  • Reduce Interconnect Demand

4
Improve Interconnect Performance
  • delay maximize propagation velocity
  • repeaters
  • overdrive
  • controlled transmission waveforms
  • power efficient signaling
  • low swing
  • signal coding
  • alternative synchronization and clock
  • noise
  • noise characterization and cancellation

Connect with Task 2
5
Driving Long Regular Wires(part of an on-chip
network)
Uniform, well characterized lines enable custom
circuits - 0.1x power, 3x velocity
Long, lossy RC lines
Regenerative Repeaters
H-bridge driver 100mV swing
Connect with Task 2
6
Transition Pattern Coding
Input Data
Recovered Data
Data Transformations
Decoder
Extra Bits
Control (Coding Scheme Selection)
Possible memory
Bits
Drivers
Receivers
Extended Data Bus (includes extra control lines)
7
Move Longest Global Wires Off-Chip
  • Move longest global interconnects off-chip
  • Improve interconnect quality and speed

Connect with Task 4
8
Improve Interconnect Performance
  • off-chip I/O
  • dense electrical I/O
  • optical I/O
  • RF I/O
  • clock distribution
  • distributed clocking
  • optical clock distribution

Connect with Task 3 and Task 4
9
Distributed Clocking (MIT)
  • Synchronized clock generated at multiple points

10
Test Chip Results ISSCC 00
  • 16 oscillators (40mm x 40mm)
  • 24 phase detectors (20mm x 40mm)
  • 0.35 mm, single poly triple metal CMOS
  • Total power 450mW at 3V, 1.3GHz

11
Optical Clocking
Local Electrical Distribution
Optical GCLK
Optical Receiver
waveguides
Connect with Task 2 and Task 3
12
Baseline Optical Receiver Fabrication
  • Test chip fabrication
  • 0.35 mm MOSIS
  • Simple Si diode detector
  • 4 receivers at corners of 2mm x 2mm chip
  • Test chip is functional (currently tested at
    300MHz)
  • Process and environmental variations had a big
    impact on clock skew
  • Future work will focus on reducing the impact
    of variations

13
Overcoming Interconnect Limitations
  • Improve Interconnect Performance
  • Improve Interconnect Utilization
  • Reduce Interconnect Demand

14
Increase Interconnect Utilization
  • Replace dedicated global wiring with a shared
    network

Dedicated wiring
Network
15
Most Wires are Idle Most of the Time
  • Dont dedicate wires to signals, share them
  • Route packets not wires
  • Organize global wiring as an on-chip
    interconnection network
  • increases global wire utilization
  • allows more flexible use of global wiring
    resource
  • offers regular, highly optimized global wiring

16
Dedicated Wires versusOn-Chip Network
17
Network RoutersRepeaters with Switching
  • Need repeaters every 1mm or less
  • Easy to insert switching
  • zero-cost reconfiguration
  • Minimize decision time
  • static routing
  • fixed or regular pattern
  • source routing
  • on-demand
  • requires arbitration and fanout
  • can be pipelined
  • Minimize buffering

1mm
1mm
Arb
LUT
18
Architecture Reduces Impact of Slow
WiresCircuits Make Wires More Efficient
  • Locality
  • Eliminate implicit global communication
  • Expose and optimize the communication
  • Clustered architecture
  • Partitioned register file
  • Data migration
  • Networking
  • Route packets, not wires
  • Improves duty factor of wires
  • Single, regular, highly-optimized design

19
Overcoming Interconnect Limitations
  • Improve Interconnect Performance
  • Improve Interconnect Utilization
  • Reduce Interconnect Demand

20
SIMD Instruction Broadcast
  • Uses long broadcasting wires
  • Long-wire delays limit system clock

Global instruction broadcast
Plot extracted from 1997 NTRS
21
Short-Wire Instruction Broadcast
  • Exclusive use of short-wire interconnects
  • Systolic fashioned instruction distribution
  • Reduced wiring demand
  • Some nearest neighbor communication issues

ACU
22
Making Broadcast Systolic
A communication instruction is composed of two
mini operations 1- Read data from source
register file and put in inter-PE buffer 2- Read
data from inter-PE buffer and write to
destination register file
r1 east port west port r2
r1 west port NOP east port r2
xfer r2 r1 East
xfer r2 r1 West
r1
r2
Inter-PE buffer
r1 west port NOP east port r2
1
r1 east port west port r2
1
r1 west port NOP east port r2
2
3
r1 east port west port r2
2
4
23
Cycle Count Penaltyfor Systolic Instruction
Broadcast
24
Impact of Breaking Long Global Wires
for 2-way CP ILP-SIMD
25
Architecture Today Depends on Fast Global
Communication
  • All instructions issued from single global
    instruction unit
  • All data passes through global register file
  • This wont work when global accesses cost 16
    clocks of latency (each way)

I-Unit
Regs
26
Clustered Architecture
  • Multiple elements (clusters) with
  • local instruction dispatch
  • local register files
  • co-located with arithmetic elements
  • Explicit communication between elements through a
    switch or network
  • Fast synchronization between instruction units

Sync
Switch
27
Multi-ALU Processor Chip
  • Exploits ILP and thread-level parallelism across
    clusters
  • Single cycle mechanisms
  • communication
  • synchronization
  • thread creation
  • Low-overhead inter-node mechanisms

28
Reduced Communication has Minimum Impact on
Performance
29
Register File Organization
  • Register files serve two functions
  • Short term storage for intermediate results
  • Communication between multiple function units
  • Which dominates area, delay, and power?
  • Global register grow as N3 where N is the number
    of ALUs
  • Need more registers to hold more results (grows
    with N)
  • Need more ports to connect all of the units
    (grows with N2)

30
Register Cells are Mostly Switch
31
SIMD and Distributed Register Files
Scalar
SIMD
Central
DRF
32
Organizations
  • 48 ALUs (32-bit), 500 MHz
  • Stream organization improves central organization
    by Area 195x, Delay 20x, Power 430x

33
Performance
16 Performance Drop (8 with latency constraints)
180x Improvement
34
Hierarchical Register Organization
35
Much Locality is Data Dependent
  • Applications have data/time-dependent graph
    structure
  • Sparse-matrix solution
  • non-zero and fill-in structure
  • Logic simulation
  • circuit topology and activity
  • PIC codes
  • structure changes as particles move
  • Sort-middle polygon rendering
  • structure changes as viewpoint moves

36
Fine-Grain Data MigrationDrift and Diffusion
  • Run-time relocation based on pointer use
  • move data at both ends of pointer
  • move control and data
  • Each relocation cycle
  • compute drift vector based on pointer use
  • compute diffusion vector based on density
    potential (resource usage)
  • need to avoid oscillations
  • Should data be replicated?
  • not just update vs. invalidate
  • need to duplicate computation to avoid
    communication

37
Results for Logic Simulation
38
Using Applications and Technology Models to
Explore Architectural Design Space
39
Power Efficiency Ranking
40
Most Power Efficient Configuration
41
Area Efficiency Ranking
42
Most Area Efficient Configuration
43
Best Overall Configuration
44
Improving Reliability Availability and
Scalability (RAS)
  • In 2007 CMOS technology transient failures will
    regularly occur in logic latches and some
    combinatorial logic. Interconnect crosstalk will
    also cause significant transient failures. (IBM
    Jour. RD)
  • Leverage multiple identical components in chip
    multiprocessor and speculative memory support to
    provide flexible RAS
  • Turn on RAS for higher reliability, turn off RAS
    for higher performance
  • Pair up processors within a Hydra quad
  • Processors compare results and retry if they
    disagree
  • Processors do not have to be in lockstep
  • Compare does not affect single thread performance
  • Temporary state is kept in speculative store
    buffers

45
Hydra Speculation Support
  • Speculation coprocessor to sequence speculative
    threads
  • Additional L1 D cache bits to track speculative
    reads and writes
  • Write buffers at L2 cache to hold speculative
    writes
  • Write bus has additional bits to support
    speculative writes

46
RAS Design
Processor 1
Processor 2
Processor 3
Processor 4
  • Compares happen
  • 100 K insts with 32 line L2 buffer

L2 Buffers
L2 Buffers
L2 Buffers
L2 Buffers
Regs
Regs
Regs
Regs




Error
Error
L2 Cache
47
Task 1 Objectives Summary
  • Interface with other tasks to improve the
    performance of available interconnect
  • Employ global wire connect architectures to
    maximize the utilization of global interconnect
  • Explore new architectures to reduce the
    requirement for global interconnect
Write a Comment
User Comments (0)
About PowerShow.com