An Interleaved Cache Clustered VLIW Processor - PowerPoint PPT Presentation

Loading...

PPT – An Interleaved Cache Clustered VLIW Processor PowerPoint presentation | free to download - id: 6bd129-ZWQ4O



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

An Interleaved Cache Clustered VLIW Processor

Description:

An Interleaved Cache Clustered VLIW Processor E. Gibert, J. S nchez* and A. Gonz lez* Dept. d Arquitectura de Computadors Universitat Polit cnica de Catalunya (UPC) – PowerPoint PPT presentation

Number of Views:16
Avg rating:3.0/5.0
Slides: 24
Provided by: egibertc
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: An Interleaved Cache Clustered VLIW Processor


1
An Interleaved Cache Clustered VLIW Processor
  • E. Gibert, J. Sánchez and A. González
  • Dept. dArquitectura de Computadors
  • Universitat Politècnica de Catalunya (UPC)
  • Also at Intel Barcelona Research Center
  • June 2002

2
Motivation
Motivation
  • Capacity-bound vs. Communication-bound
  • Solution clustered microarchitectures
  • Partition some hardware resources
  • Simpler faster
  • Power consumption
  • Communications not homogeneous
  • Goal clustering the memory hierarchy in
    statically scheduled processors

3
Talk Outline
  • State-of-the-art multiVLIW
  • Interleaved Cache Clustered VLIW
  • Scheduling Algorithms
  • Enhancement Attraction Buffers
  • Experimental Framework
  • Results
  • Conclusions

4
State-of-the-art MultiVLIW
  • Sánchez and González MICRO00

5
Talk Outline
  • State-of-the-art multiVLIW
  • Interleaved Cache Clustered VLIW
  • Scheduling Algorithms
  • Enhancement Attraction Buffers
  • Experimental Framework
  • Results
  • Conclusions

6
Basic Interleaved Cache Clustered VLIW Processor
NEXT MEMORY LEVEL
cache block
memory buses
TAG
W0
W4
cache module
FUs
FUs
FUs
FUs
Reg. File
Reg. File
Reg. File
Reg. File
CLUSTER 2
CLUSTER 3
CLUSTER 4
CLUSTER 1
Register-to-register buses
7
Talk Outline
  • State-of-the-art multiVLIW
  • Interleaved Cache Clustered VLIW
  • Scheduling Algorithms
  • Enhancement Attraction Buffers
  • Experimental Framework
  • Results
  • Conclusions

8
Modulo Scheduling
  • Extract ILP from loops ? overlap execution of
    iterations

LOOP L
A
II
A
B
A
SC
B
C
B
A
Kernel
C
C
B
C
9
Base Scheduling Algorithm
  • Used for Unified Cache

IIII1
0
gt0
How Many?
Select possible clusters
START
Next node
Best profit in output edges
How Many?
1
Sort nodes
Schedule it
gt1
Least loaded
10
Interleaved Cache Scheduling Algorithm
  • Unroll loop to maximize instructions with a
    stride multiple of NxI ? access ONE cache module
  • Assign latencies to memory instructions
  • Assign memory instructions to clusters
  • IPBC (Interleaved Pre-Build Chains)
  • ? minimize stall time
  • IBC (Interleaved Build Chains)
  • ? minimize compute time

11
Memory Dependent Instructions
IPBC ? preferred info is used vs. IBC ? minimize
register comms.
Preferred1
add
Preferred2
load
memory dependant chain 2
memory dependant chain 1
Preferred1
add
Preferred2
store
12
Talk Outline
  • State-of-the-art multiVLIW
  • Interleaved Cache Clustered VLIW
  • Scheduling Algorithms
  • Enhancement Attraction Buffers
  • Experimental Framework
  • Results
  • Conclusions

13
Enhacement Attraction Buffers
ADDRESS
CACHE MODULE
Local Data
ABuffer
local logic
data
hit
data
hit
data
hit
14
An Example
for (i0 iltMAX i) ld r3, ai r4
OP(r3) st r4, bi
  • for (i0 iltMAX i4)
  • ld r31, ai (stride 16)
  • ld r32, ai1
  • ld r33, ai2
  • ld r34, ai3
  • r41 OP(r31)
  • r42 OP(r32)
  • r43 OP(r33)
  • r44 OP(r34)
  • st r41, bi
  • st r42, bi1
  • st r43, bi2
  • st r44, bi3
  • 16 byte strides (NxI multiple)
  • N 4 clusters, I 4 bytes

Unroll x4
a0 a1 a2 a3 ...
Local module
ABuffer
CLUSTER 4
ld r31, a0
15
Enhacement Attraction Buffers
  • Why remote accesses? Why Attraction Buffers?
  • Double precision accesses ? low benefit
  • Indirect accesses abi? low benefit
  • Unclear preferred cluster ? big benefit
  • for (i0 iltMAX i)
  • for (ki kltiMAX k4)
  • ld ak, ld ak1, ld ak2, ld ak3
  • Memory dependent chains ? big benefit
  • IBC preferred cluster info is not used ? big
    benefit

16
Talk Outline
  • State-of-the-art multiVLIW
  • Interleaved Cache Clustered VLIW
  • Scheduling Algorithms
  • Enhancement Attraction Buffers
  • Experimental Framework
  • Results
  • Conclusions

17
Experimental Framework
  • IMPACT C compiler
  • Modulo scheduling on hyperblock loops
  • BASE for a Unified Cache
  • IPBC and IBC for an Interleaved Cache
  • IPBC and IBC for the MultiVLIW
  • The same unrolling factor has been used for all
    architecture configurations!
  • Mediabench benchmark suite

18
Experimental Framework
Number of clusters 4
Functional units 1 FP / cluster 1 int / cluster 1 mem / cluster
Cache configuration 8KB, 32-byte lines, 2-way set associative, 1 cycle latency
Reg-to-reg communication buses 4 buses that run at ½ the core frequency
Memory buses 4 buses that run at ½ (or ¼) the core frequency
Next memory level 4 ports, 5 cycle latency, always hit
Interleaving factor (Interleaved Cache) 4 bytes
Latencies 1-10 (Unified Cache MultiVLIW) 1-(5/6)-10-15 (Interleaved Cache)
19
Results (I)
  • IPBC vs IBC ? similar cycle count results
  • MultiVLIW vs Interleaved ? similar results BUT
    lower complexity!

20
Results (II)
  • Memory dependent chains
  • Interleaved cache? workload unbalance ? remote
    accesses
  • MultiVLIW ? workload unbalance
  • Working on techniques to overcome scheduling
    restrictions

21
Results (III)
  • Local hits are increased by 15
  • Stall time reduced by 30

22
Conclusions
  • Scheduling Algorithms
  • Good latency assignment process (stall time
    accounts for 9 of execution time)
  • Coherence kept through memory dependent chains
    (5 cycle count degradation)
  • Attraction Buffers
  • Effective to increase local hits (15 average)
    reduce stall time (30 average)
  • Reduce remote hits to previously accessed
    subblocks (70 average)
  • Cycle count results
  • similar to Unified Cache and MultiVLIW

23
Questions
About PowerShow.com