Local Scheduling Techniques for Memory Coherence in a Clustered VLIW Processor with a Distributed Data Cache - PowerPoint PPT Presentation

Loading...

PPT – Local Scheduling Techniques for Memory Coherence in a Clustered VLIW Processor with a Distributed Data Cache PowerPoint presentation | free to download - id: 79fc7a-YjkwY



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

Local Scheduling Techniques for Memory Coherence in a Clustered VLIW Processor with a Distributed Data Cache

Description:

Local Scheduling Techniques for Memory Coherence in a Clustered VLIW Processor with a Distributed Data Cache Enric Gibert1 Jes s S nchez2 Antonio Gonz lez1,2 – PowerPoint PPT presentation

Number of Views:73
Avg rating:3.0/5.0
Slides: 29
Provided by: egibertc
Learn more at: http://people.ac.upc.edu
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Local Scheduling Techniques for Memory Coherence in a Clustered VLIW Processor with a Distributed Data Cache


1
Local Scheduling Techniques for Memory Coherence
in a Clustered VLIW Processor with a Distributed
Data Cache
  • Enric Gibert1
  • Jesús Sánchez2
  • Antonio González1,2

1Dept. dArquitectura de Computadors Universitat
Politècnica de Catalunya (UPC) Barcelona
2Intel Barcelona Research Center Intel
Labs Barcelona
2
Motivation
  • Capacity vs. Communication-bound
  • Clustered microarchitectures
  • Simpler faster
  • Power consumption
  • Communications not homogeneous
  • Clustering ? embedded/DSP domain

3
Clustered Microarchitectures
4
Contributions
  • Distribution of data cache
  • Architecture design data mapping
  • Word-interleaved scheme ICS02
  • Appropriate scheduling techniques MICRO02
  • Memory coherence
  • Scheduling techniques for mem. coherence
  • Local software-based techniques
  • Applied to word-interleaved cache
  • Complex conf. (with Attraction Buffers refer to
    paper)
  • Simple conf. (without Attraction Buffers)
  • Applicable to any other cache configuration

5
Talk Outline
  • Architecture and Scheduling Algorithms
  • Memory Coherence Problem
  • Solutions
  • Memory Dependent Chains (MDC)
  • DDG Transformations (DDGT)
  • Evaluation
  • Conclusions

6
Word-Interleaved Distribution
L2 cache
cache block
TAG
W0
W4
TAG
W1
W5
TAG
W2
W6
TAG
W3
W7
cache module
cache module
cache module
cache module
Func. Units
Func. Units
Func. Units
Func. Units
Register File
Register File
Register File
Register File
CLUSTER 1
CLUSTER 2
CLUSTER 3
CLUSTER 4
Register-to-register communication buses
7
Scheduling Techniques
a0
a4
a1
a5
a2
a6
a3
a7
cache module
cache module
cache module
cache module
CLUSTER 1
CLUSTER 2
CLUSTER 3
CLUSTER 4
Modulo scheduling Loop unrolling Assignment of
latencies Padding Profiling
for (i0 iltMAX i) ld r3, ai r4
OP(r3) st r4, bi
for (i0 iltMAX i4) ld r31, ai (stride
16 bytes) ld r32, ai1 (stride 16 bytes) ld
r33, ai2 (stride 16 bytes) ld r34, ai3
(stride 16 bytes) ...
8
Cluster Assignment
  • Non-memory instructions
  • Minimize register communications
  • Maximize workload balance
  • Memory instructions ? 2 heuristics
  • PrefClus Heuristic
  • Preferred Cluster most accessed cluster
  • Profiling Padding
  • MinComs Heuristic
  • Minimize register communications
  • Maximize workload balance
  • Post-pass phase to increase local accesses

9
Talk Outline
  • Architecture and Scheduling Algorithms
  • Memory Coherence Problem
  • Solutions
  • Memory Dependent Chains (MDC)
  • DDG Transformations (DDGT)
  • Evaluation
  • Conclusions

10
Memory Coherence Problem
NEXT MEMORY LEVEL
memory buses
Cache module
Cache module
Remote accesses Misses Replacements Others
NON-DETERMINISTIC BUS LATENCY!!!
CLUSTER 1
CLUSTER 4
cycle i - - - store to a0
cycle i1 - - - -
cycle i2 - - - -
cycle i3 - - - -
cycle i4 load from a0 - - -
11
Talk Outline
  • Architecture and Scheduling Algorithms
  • Memory Coherence Problem
  • Solutions
  • Memory Dependent Chains (MDC)
  • DDG Transformations (DDGT)
  • Evaluation
  • Conclusions

12
Solutions Outline
  • Local scheduling solutions ? applied at a loop
    granularity
  • Memory Dependent Chains (MDC)
  • Data Dependence Graph Transformations (DDGT)
  • Store replication
  • Load-store synchronization
  • Software-based solutions
  • Applicable to other configurations
  • Replicated distributed cache
  • MultiVLIW MICRO00

13
Memory Dependent Chains
  • Sets of aliased instructions
  • Memory Dependent Chains (MDC)
  • Instructions in same set
  • Assigned to same cluster
  • Restrictions on cluster
  • assignment
  • PrefClus average preferred
  • cluster
  • MinComs minimize comms.
  • when scheduling first node

MF memory-flow MA memory-anti RF
register-flow
n1 load
RF
n6 load
n2 load
MA
RF
RF
MF
MF
n7 div
RF
n3 add
MA
RF
RF
n8 add
n4 store
14
Memory Dependent Chains
NEXT MEMORY LEVEL
memory buses
Cache module
Cache module
CLUSTER 1
CLUSTER 4
cycle i - - - store to a0
cycle i1 - - - -
cycle i2 - - - -
cycle i3 - - - -
cycle i4 load from a0 - - -
15
DDGT Store Replication
  • Overcome MEM_FLOW (MF) and MEM_OUT (MO)

store replication
store A
store A
store A
store A
store A
MF
MF
load B
load B
store replication
store A
store A
store A
store A
store A
MO
MO
store B
store B
store B
store B
store B
16
DDGT Store Replication
NEXT MEMORY LEVEL
memory buses
Cache module
Cache module
CLUSTER 1
CLUSTER 4
cycle i - - - store to a0
cycle i1 store to a0 - store to a0 -
cycle i2 - - - -
cycle i3 - store to a0 - -
cycle i4 load from a0 - - -
17
DDGT ld-st Synchronization
  • Overcome MEM_ANTI (MA) dependences
  • Special cases
  • Store is already REG_FLOW dependent on the load
  • Impossible recurrences

18
MDC Solution Case Study
  • Impact on compute time
  • May increase the IIres

load A
load B
MA
MF
MF
store C
  • Impact on stall time
  • May increase remote accesses
  • Extra stall cycles 3 cycles / iteration

19
DDGT Solution Case Study
  • Impact on compute time
  • More instructions (IIres)
  • Store replication
  • Fake consumers (few)
  • Register communications

load A
set of memory instructions X
MA
MF
store B
  • Impact on stall time
  • Small
  • New dependences may decrease slack of some memory
    instructions

20
Talk Outline
  • Architecture and Scheduling Algorithms
  • Memory Coherence Problem
  • Solutions
  • Memory Dependent Chains (MDC)
  • DDG Transformations (DDGT)
  • Evaluation
  • Conclusions

21
Evaluation Framework
  • IMPACT C compiler
  • Compile optimize memory disambiguation
  • Mediabench benchmark suite

Profile Execution
epicdec test_image titanic
g721dec clinton S_16_44
g721enc clinton S_16_44
gsmdec clinton S_16_44
gsmenc clinton S_16_44
jpegdec testimg monalisa
jpegenc testimg monalisa
Profile Execution
mpeg2dec mei16v2 tek6
pegwitdec pegwit techrep
pegwitenc pgptest techrep
pgpdec pgptext techrep
pgpenc pgptest techrep
rasta ex5_c1 ex5_c1
22
Evaluation Framework
Word-Interleaved Cache Clustered VLIW Processor
clusters 4
Functional units 1 FP / cluster 1 integer / cluster 1 memory / cluster
Register buses 4 buses running at ½ the core freq.
Memory buses 4 buses running at ½ the core freq.
Cache configuration 8KB, 2-way set-associative, 32 byte blocks L2 always hits
Cache latencies Local Hit1 Remote Hit5 Local Miss10 Remote Miss15
Algorithm PrefClus and MinComs
Interleaving factor 2 or 4 bytes depending on benchmark
BASELINE Same architecture but complete freedom when assigning instructions to clusters
23
Local vs. Remote Accesses
24
Execution Time
25
Other Configurations
  • Configuration 1

Latency
Buses
Latency
Buses
2
4
Memory buses
4
2
Register buses
More pressure on register buses MDC outperforms
DDGT in all cases ? MDC requires less
register communications
26
Talk Outline
  • Architecture and Scheduling Algorithms
  • Memory Coherence Problem
  • Solutions
  • Memory Dependent Chains (MDC)
  • DDG Transformations (DDGT)
  • Evaluation
  • Conclusions

27
Conclusions
  • Memory coherence problem
  • Two software-based solutions MDC and DDGT
  • Applied to a word-interleaved cache clustered
    VLIW processor
  • MDC vs DDGT
  • Results depending on architecture configuration
  • MDC outperforms DDGT in most cases
  • DDGT better by up to 20 in specific
    configuration
  • Sets of memory dependent insts. are small
  • DDGT ? freedom in cluster assignment
  • Increase local accesses by 15 ? reduce stall
    time

28
Questions?
About PowerShow.com