PPT – 1 PowerPoint presentation | free to download

About This Presentation

Title:

1

Description:

Two RSA Public keys owned by Adobe (1024 bit and 912 bit in length) are involved ... Sign Reader Integration Key License Agreement with Adobe ... – PowerPoint PPT presentation

Number of Views:39

Avg rating:3.0/5.0

Slides: 21

Provided by: hongta

Learn more at: https://cccp.eecs.umich.edu

Category:

more less

Transcript and Presenter's Notes

Title: 1

1
A Distributed Control Path Architecture for VLIW
Processors

Hongtao Zhong, Kevin Fan, Scott Mahlke,
and Michael Schlansker
Advanced Computer Architecture Laboratory
University of Michigan
HP Laboratories

2
Motivation

VLIW Scaling Problem
Centralized resource
Highly ported structures
Wire delays

Register File
Register File
FU
FU
FU
FU
FU
FU
FU
FU

FU
FU
Instruction Fetch/Decode
Instruction Fetch/Decode
3
Multicluster VLIW

Distribute register files
Cluster function units
Distribute data caches
Clusters communicate through interconnection
network
Used in TI C6x, Lx/ST200, Analog Tigersharc

Interconnection network
Cluster 0
Cluster 1
Register File
Register File
FU
FU
FU
FU
Instruction Fetch/Decode
4
Control Path Scaling Problem

Larger I-cache
Latency
Long wires for control signals distribution
Code compression
Hardware cost, power
Grow quadratically with the number of FUs

NOP
NOP
B
A
IR
align/shiftnetwork
C
B
A
X
G
F
E
D
PC
I-cache
5
Straight Forward Approach

Distribute I-fetch in spirit similar to
distribution of data path
Local communication of controls
Reduce latency, hardware cost, power
Used in Multiflow Trace 14/300 processors

Interconnection network
Interconnection network
Register File
Register File
Register File
Register File
FU
FU
FU
FU
FU
FU
FU
FU
IR
IR
PC
PC
I-cache
I-cache
6
DVLIW Approach

Simple distribution has problems
Doesnt support code compression
PC still a centralized resource

Interconnection network
Interconnection network
Register File
Register File
Register File
Register File
FU
FU
FU
FU
FU
FU
FU
FU
IR
IR
align/shift
align/shift
PC
PC0
PC1
I-cache
I-cache
7
DVLIW Execution Model

Clusters execute in lock-step
When one cluster stalls, all clusters stall
Clusters collectively execute one thread
Each cluster runs an instruction stream
Compiler orchestrates the execution of streams
Compiler manages communication
Light weight synchronization

8
DVLIW Benefits

Completely decentralized architecture
Distributed data path
Distributed control path
Supports arbitrary code compression
Exploiting ILP on multi-core style system
Good for embedded applications
Low cost
Compiler support

9
DVLIW Architecture
To cluster 1
To cluster 2
Banked L2

IC
FU
MFU
VLIWCluster 0
VLIWCluster 1
br_target
Register Files

IR
B
NOP
A
align/shift
VLIWCluster 3
VLIWCluster 2
L1 D-Cache
Next PC
B
A
PC
L1 I-Cache
Banked L2
To Banked L2
10
Code Organization
DVLIW
Conventional VLIW

Code for each cluster is consecutive in memory
Operations in the same MultiOp stored in
different memory locations
Each cluster computes its own next PC

PC
PC0
PC1
11
Branch Mechanism

Maintain correct execution order
All clusters transfer control at the same cycle
All clusters branch to the same logical multiop
Unbundled branch in HPL-PD

Each cluster specifies its own target
PBR btr1, TARGET
Branch
CMPP pr0, (x100)?
Broadcast to all clusters
BR btr1, pr0
Replicated in each cluster
12
Branch Handling Example
pbr btr1, BB2 cmpp pr0, (x100)? br btr1, pr0
pbr btr1, BB2 . . br btr1, pr0
pbr btr1, BB2 cmpp pr0, (x100)? bcast pr0 br
btr1, pr0
Cluster 1
Cluster 0
Conventional VLIW
DVLIW
13
Sleep Mode

Idle blocks after distribution
Put cluster into sleep mode
Compiler managed
Save energy
Reduce code size
Mode change happens at block boundary

SLEEP
BR
BR
BR
BR
WAKE
Cluster 1
Cluster 0
14
Experimental Setup

Trimaran toolset
Processor configuration
4 clusters, 2 INT, 1 FP, 1 MEM, 1 BR per cluster
16K L1 I-cache total
Perfect data cache assumed
Power Model
Verilog for instruction align/shift logic
Wire model
Cacti cache model
21 benchmarks from MediaBench and SPECINT2000

15
Change in Global Communication Bits
MediaBench
SPECINT
16
Normalized Energy Consumption on Control Path
Control path energy (align/shift logic energy)
(wire energy) (I-cache energy)
40 saving
67 saving
80 saving
21 saving
17
Normalized Code Size
Baseline Conventional VLIW with compressed
encoding Traditional method (single PC) 7x
increase DVLIW 40 increase
18
Result Summary

DVLIW benefits
Order of magnitude reduction in global
communication
40 savings in control path energy
5x code size reduction vs. simple distribution
Small overhead for ILP execution on CMP
3 increase in execution cycles
4 increase in I-cache stalls

19
Conclusions

DVLIW removes last centralized resource in a
multicluster VLIW
Fully distributed control path
Scalable architecture
More energy efficient
Stylized CMP architecture
Exploit ILP
Multiple instruction streams
Compiler orchestrated

1 - PowerPoint PPT Presentation

1

Two RSA Public keys owned by Adobe (1024 bit and 912 bit in length) are involved ... Sign Reader Integration Key License Agreement with Adobe ... – PowerPoint PPT presentation