1 - PowerPoint PPT Presentation

About This Presentation
Title:

1

Description:

Two RSA Public keys owned by Adobe (1024 bit and 912 bit in length) are involved ... Sign Reader Integration Key License Agreement with Adobe ... – PowerPoint PPT presentation

Number of Views:39
Avg rating:3.0/5.0
Slides: 21
Provided by: hongta
Category:

less

Transcript and Presenter's Notes

Title: 1


1
A Distributed Control Path Architecture for VLIW
Processors
  • Hongtao Zhong, Kevin Fan, Scott Mahlke,
  • and Michael Schlansker
  • Advanced Computer Architecture Laboratory
  • University of Michigan
  • HP Laboratories

2
Motivation
  • VLIW Scaling Problem
  • Centralized resource
  • Highly ported structures
  • Wire delays

Register File
Register File
FU
FU
FU
FU
FU
FU
FU
FU

FU
FU
Instruction Fetch/Decode
Instruction Fetch/Decode
3
Multicluster VLIW
  • Distribute register files
  • Cluster function units
  • Distribute data caches
  • Clusters communicate through interconnection
    network
  • Used in TI C6x, Lx/ST200, Analog Tigersharc

Interconnection network
Cluster 0
Cluster 1
Register File
Register File
FU
FU
FU
FU
Instruction Fetch/Decode
4
Control Path Scaling Problem
  • Larger I-cache
  • Latency
  • Long wires for control signals distribution
  • Code compression
  • Hardware cost, power
  • Grow quadratically with the number of FUs

NOP
NOP
B
A
IR
align/shiftnetwork
C
B
A
X
G
F
E
D
PC
I-cache
5
Straight Forward Approach
  • Distribute I-fetch in spirit similar to
    distribution of data path
  • Local communication of controls
  • Reduce latency, hardware cost, power
  • Used in Multiflow Trace 14/300 processors

Interconnection network
Interconnection network
Register File
Register File
Register File
Register File
FU
FU
FU
FU
FU
FU
FU
FU
IR
IR
PC
PC
I-cache
I-cache
6
DVLIW Approach
  • Simple distribution has problems
  • Doesnt support code compression
  • PC still a centralized resource

Interconnection network
Interconnection network
Register File
Register File
Register File
Register File
FU
FU
FU
FU
FU
FU
FU
FU
IR
IR
align/shift
align/shift
PC
PC0
PC1
I-cache
I-cache
7
DVLIW Execution Model
  • Clusters execute in lock-step
  • When one cluster stalls, all clusters stall
  • Clusters collectively execute one thread
  • Each cluster runs an instruction stream
  • Compiler orchestrates the execution of streams
  • Compiler manages communication
  • Light weight synchronization

8
DVLIW Benefits
  • Completely decentralized architecture
  • Distributed data path
  • Distributed control path
  • Supports arbitrary code compression
  • Exploiting ILP on multi-core style system
  • Good for embedded applications
  • Low cost
  • Compiler support

9
DVLIW Architecture
To cluster 1
To cluster 2
Banked L2

IC
FU
MFU
VLIWCluster 0
VLIWCluster 1
br_target
Register Files

IR
B
NOP
A
align/shift
VLIWCluster 3
VLIWCluster 2
L1 D-Cache
Next PC
B
A
PC
L1 I-Cache
Banked L2
To Banked L2
10
Code Organization
DVLIW
Conventional VLIW
  • Code for each cluster is consecutive in memory
  • Operations in the same MultiOp stored in
    different memory locations
  • Each cluster computes its own next PC

PC
PC0
PC1
11
Branch Mechanism
  • Maintain correct execution order
  • All clusters transfer control at the same cycle
  • All clusters branch to the same logical multiop
  • Unbundled branch in HPL-PD

Each cluster specifies its own target
PBR btr1, TARGET
Branch
CMPP pr0, (x100)?
Broadcast to all clusters
BR btr1, pr0
Replicated in each cluster
12
Branch Handling Example
pbr btr1, BB2 cmpp pr0, (x100)? br btr1, pr0
pbr btr1, BB2 . . br btr1, pr0
pbr btr1, BB2 cmpp pr0, (x100)? bcast pr0 br
btr1, pr0
Cluster 1
Cluster 0
Conventional VLIW
DVLIW
13
Sleep Mode
  • Idle blocks after distribution
  • Put cluster into sleep mode
  • Compiler managed
  • Save energy
  • Reduce code size
  • Mode change happens at block boundary

SLEEP
BR
BR
BR
BR
WAKE
Cluster 1
Cluster 0
14
Experimental Setup
  • Trimaran toolset
  • Processor configuration
  • 4 clusters, 2 INT, 1 FP, 1 MEM, 1 BR per cluster
  • 16K L1 I-cache total
  • Perfect data cache assumed
  • Power Model
  • Verilog for instruction align/shift logic
  • Wire model
  • Cacti cache model
  • 21 benchmarks from MediaBench and SPECINT2000

15
Change in Global Communication Bits
MediaBench
SPECINT
16
Normalized Energy Consumption on Control Path
Control path energy (align/shift logic energy)
(wire energy) (I-cache energy)
40 saving
67 saving
80 saving
21 saving
17
Normalized Code Size
Baseline Conventional VLIW with compressed
encoding Traditional method (single PC) 7x
increase DVLIW 40 increase
18
Result Summary
  • DVLIW benefits
  • Order of magnitude reduction in global
    communication
  • 40 savings in control path energy
  • 5x code size reduction vs. simple distribution
  • Small overhead for ILP execution on CMP
  • 3 increase in execution cycles
  • 4 increase in I-cache stalls

19
Conclusions
  • DVLIW removes last centralized resource in a
    multicluster VLIW
  • Fully distributed control path
  • Scalable architecture
  • More energy efficient
  • Stylized CMP architecture
  • Exploit ILP
  • Multiple instruction streams
  • Compiler orchestrated

20
Thank You
  • For more information
  • http//cccp.eecs.umich.edu
Write a Comment
User Comments (0)
About PowerShow.com