Token Coherence: A Framework for Implementing Multiple-CMP Systems

About This Presentation

Title:

Token Coherence: A Framework for Implementing Multiple-CMP Systems

Description:

Keep: Flat for Correctness. Exploit: Hierarchical ... remains flat. Tokens to ... Flat for correctness. Slide 14. Improving Multiple-CMP Systems using ... – PowerPoint PPT presentation

Number of Views:38

Avg rating:3.0/5.0

Slides: 47

Provided by: Multiface3

Learn more at: http://research.cs.wisc.edu

Category:

more less

Transcript and Presenter's Notes

Title: Token Coherence: A Framework for Implementing Multiple-CMP Systems

1
Token Coherence A Framework for Implementing
Multiple-CMP Systems

Mike Marty1, Jesse Bingham2, Mark Hill1, Alan
Hu2, Milo Martin3, and David Wood1
1University of Wisconsin-Madison
2University of British Columbia
3University of Pennsylvania
February 17th, 2005

2
Summary

Microprocessor ? Chip Multiprocessor (CMP)
Symmetric Multiprocessor (SMP) ? Multiple CMPs
Problem Coherence with Multiple CMPs
Old Solution Hierarchical Directory Complex
Slow
New Solution Apply Token Coherence
Developed for glueless multiprocessor 2003
Keep Flat for Correctness
Exploit Hierarchical for performance
Less Complex Faster than Hierarchical Directory

3
Outline

Motivation and Background
Coherence in Multiple-CMP Systems
Example DirectoryCMP
Token Coherence Flat for Correctness
Token Coherence Hierarchical for Performance
Evaluation

4
Coherence in Multiple-CMP Systems

Chip Multiprocessors (CMPs) emerging
Larger systems will be built with Multiple CMPs

CMP 2
CMP 1
interconnect
CMP 3
CMP 4
5
Problem Hierarchical Coherence

Intra-CMP protocol for coherence within CMP
Inter-CMP protocol for coherence between CMPs
Interactions between protocols increase
complexity
explodes state space

CMP 2
CMP 1
interconnect
Inter-CMP Coherence
Intra-CMP Coherence
CMP 3
CMP 4
6
Improving Multiple CMP Systems with Token
Coherence

Token Coherence allows Multiple-CMP systems to
be...
Flat for correctness, but
Hierarchical for performance

Low Complexity
Fast
Correctness Substrate
CMP 2
CMP 1
interconnect
Performance Protocol
CMP 3
CMP 4
7
Example DirectoryCMP
2-level MOESI Directory
RACE CONDITIONS!
CMP 0
CMP 1
Store B
Store B
P0
P1
P2
P3
P4
P5
P6
P7
L1 ID
L1 ID
L1 ID
L1 ID
L1 ID
L1 ID
L1 ID
L1 ID
O
S
S
S
data/ ack
data/ ack
getx
WB
getx
inv
inv
ack
ack
inv
fwd
ack
data/ ack
Shared L2 / directory
Shared L2 / directory
S
getx
WB
fwd
getx
B S O
B M I
Memory/Directory
Memory/Directory
8
Token Coherence Summary

Token Coherence separates performance from
correctness
Correctness Substrate Enforces coherence
invariant and prevents starvation
Safety with Token Counting
Starvation Avoidance with Persistent Requests
Performance Policy Makes the common case fast
Transient requests to seek tokens
Unordered, untracked, unacknowledged
Possible prediction, multicast, filters, etc

9
Outline

Motivation and Background
Token Coherence Flat for Correctness
Safety
Starvation Avoidance
Token Coherence Hierarchical for Performance
Evaluation

10
Example Token Coherence ISCA 2003
Store B
Load B
Load B
Store B
P0
P1
P2
P3
L1 ID
L1 ID
L1 ID
L1 ID
L2
L2
L2
L2
interconnect
mem 0
mem 3

Each memory block initialized with T tokens
Tokens stored in memory, caches, messages
At least one token to read a block
All tokens to write a block

11
Extending to Multiple-CMP System
CMP 0
CMP 1
P0
P1
P2
P3
L1 ID
L1 ID
L1 ID
L1 ID
L2
L2
L2
L2
interconnect
interconnect
Shared L2
Shared L2
interconnect
mem 0
mem 1
12
Extending to Multiple-CMP System
CMP 0
CMP 1
Store B
Store B
P0
P1
P2
P3
L1 ID
L1 ID
L1 ID
L1 ID
interconnect
interconnect
Shared L2
Shared L2
mem 0
mem 1
interconnect

Token counting remains flat
Tokens to caches
Handles shared caches and other complex
hierarchies

13
Safety Recap

Safety Maintain coherence invariant
Only one writer, or multiple readers
Tokens for Safety
T Tokens associated with each memory block
tokens encoded in 1log2T
Processor acquires all tokens to write, a single
token to read
Tokens passed to nodes in glueless multiprocessor
scheme
But CMPs have private and shared caches
Tokens passed to caches in Multiple-CMP system
Arbitrary cache hierarchy easily handled
Flat for correctness

14
Some Token Counting Implications

Memory must store tokens
Separate RAM
Use extra ECC bits
Token cache
T sized to caches to allow read-only copies in
all caches
Replacements cannot be silent
Tokens must not be lost or dropped
Targeted for invalidate-based protocols
Not a solution for write-through or update
protocols
Tokens must be identified by block address
Address must be in all token-carrying messages

15
Starvation Avoidance

Request messages can miss tokens
In-flight tokens
Transient Requests are not tracked throughout
system
Incorrect filtering, multicast, destination-set
prediction, etc
Possible Solution Retries
Retry w/ optional randomized backoff is effective
for races
Guaranteed Solution Persistent Requests
Heavyweight request guaranteed to succeed
Should be rare (uses more bandwidth)
Locates all tokens in the system
Orders competing requests

16
Starvation Avoidance
CMP 0
CMP 1
Store B
Store B
Store B
P0
P1
P2
P3
L1 ID
L1 ID
L1 ID
L1 ID
interconnect
interconnect
Shared L2
Shared L2
mem 0
mem 1
interconnect

Tokens move freely in the system
Transient requests can miss in-flight tokens
Incorrect speculation, filters, prediction, etc

17
Starvation Avoidance
CMP 0
CMP 1
Store B
Store B
Store B
P0
P1
P2
P3
L1 ID
L1 ID
L1 ID
L1 ID
interconnect
interconnect
Shared L2
Shared L2
mem 0
mem 1
interconnect

Solution issue Persistent Request
Heavyweight request guaranteed to succeed
Methods Centralized 2003 and Distributed
(New)

18
Old Scheme Central Arbiter 2003
CMP 0
CMP 1
Store B
Store B
Store B
timeout
timeout
timeout
P0
P1
P2
P3
L1 ID
L1 ID
L1 ID
L1 ID
interconnect
interconnect
Shared L2
Shared L2
mem 0
mem 1
interconnect
arbiter 0
B P0
arbiter 0
B P2
B P1

Processors issue persistent requests

19
Old Scheme Central Arbiter 2003
CMP 0
CMP 1
Store B
Store B
Store B
Store B
P0
P1
P2
P3
L1 ID
L1 ID
L1 ID
L1 ID
B P0
B P0
B P0
B P0
interconnect
interconnect
Shared L2
Shared L2
B P0
B P0
mem 0
mem 1
interconnect
arbiter 0
B P0
arbiter 0
B P2
B P1

Processors issue persistent requests
Arbiter orders and broadcasts activate

20
Old Scheme Central Arbiter 2003
CMP 0
CMP 1
Store B
Store B
Store B
P0
P1
P2
P3
L1 ID
L1 ID
L1 ID
L1 ID
B P0
B P0
B P0
B P0
B P2
B P2
B P2
B P2
3
interconnect
interconnect
Shared L2
Shared L2
B P0
B P2
B P0
B P2
1
2
mem 0
mem 1
interconnect
arbiter 0
B P0
arbiter 0
B P2
B P2
B P1

Processor sends deactivate to arbiter
Arbiter broadcasts deactivate (and next activate)
Bottom Line handoff is 3 message latencies

21
Improved Scheme Distributed Arbitration NEW
CMP 0
CMP 1
Store B
Store B
Store B
P0
P1
P2
P3
P0 B
P0 B
P0 B
P0 B
P1 B
P1 B
P1 B
P1 B
L1 ID
L1 ID
L1 ID
L1 ID
P2 B
P2 B
P2 B
P2 B
interconnect
interconnect
P0 B
P0 B
Shared L2
Shared L2
P1 B
P1 B
P2 B
P2 B
mem 0
mem 1
interconnect
P0 B
P1 B
P2 B

Processors broadcast persistent requests

22
Improved Scheme Distributed Arbitration NEW
CMP 0
CMP 1
Store B
Store B
Store B
P0
P1
P2
P3
P0 B
P0 B
P0 B
P0 B
P0 B
P0 B
P0 B
P0 B
P1 B
P1 B
P1 B
P1 B
L1 ID
L1 ID
L1 ID
L1 ID
P2 B
P2 B
P2 B
P2 B
interconnect
interconnect
P0 B
P0 B
P0 B
P0 B
Shared L2
Shared L2
P1 B
P1 B
P2 B
P2 B
mem 0
mem 1
interconnect
P0 B
P0 B
P1 B
P2 B

Processors broadcast persistent requests
Fixed priority (processor number)

23
Improved Scheme Distributed Arbitration NEW
CMP 0
CMP 1
Store B
Store B
P0
P1
P2
P3
P0 B
P0 B
P0 B
P0 B
P1 B
P1 B
P1 B
P1 B
P1 B
P1 B
P1 B
P1 B
1
L1 ID
L1 ID
L1 ID
L1 ID
P2 B
P2 B
P2 B
P2 B
interconnect
interconnect
P0 B
P0 B
Shared L2
Shared L2
P1 B
P1 B
P1 B
P1 B
P2 B
P2 B
mem 0
mem 1
interconnect
P0 B
P1 B
P1 B
P2 B

Processors broadcast persistent requests
Fixed priority (processor number)
Processors broadcast deactivate

24
Improved Scheme Distributed Arbitration NEW
CMP 0
CMP 1
P0
P1
P2
P3
P1 B
P1 B
P1 B
P1 B
P1 B
P1 B
P1 B
P1 B
L1 ID
L1 ID
L1 ID
L1 ID
P2 B
P2 B
P2 B
P2 B
interconnect
interconnect
Shared L2
Shared L2
P1 B
P1 B
P1 B
P1 B
P2 B
P2 B
mem 0
mem 1
interconnect
P1 B
P1 B
P2 B

Bottom line Handoff is a single message latency
Subtle point P0 and P1 must wait until next
wave

25
Implementing Distributed Persistent Requests

Table at each cache
Sized to N entries for each processor (we use
N1)
Indexed by processor ID
Content-addressable by Address
Each incoming message must access table
Not on the critical path can be slow CAM
Activate/deactivate reordering cannot be allowed
Persistent request virtual channel must be
point-to-point ordered
Or, other solution such as sequence numbers or
acks

26
Implementing Distributed Persistent Requests

Should reads be distinguished from writes?
Not necessary, but
Persistent Read request is helpful
Implications of flat distributed arbitration
Simple ? flat for correctness
Global broadcast when used
Fortunately they are rare in typical workloads
(0.3)
Bad workload (very high contention) would burn
bandwidth
Maximum processors must be architected
What about a hierarchical persistent request
scheme?
Possible, but correctness is no longer flat
Make the common case fast

27
Reducing Unnecessary Traffic

Problem Which token-holding cache responds with
data?
Solution Distinguish one token as the owner
token
The owner includes data with token response
Clean vs. dirty owner distinction also useful for
writebacks

28
Outline

Motivation and Background
Token Coherence Flat for Correctness
Token Coherence Hierarchical for Performance
TokenCMP
Another look at performance policies
Evaluation

29
Hierarchical for Performance TokenCMP

Target System
2-8 CMPs
Private L1s, shared L2 per CMP
Any interconnect, but high-bandwidth
Performance Policy Goals
Aggressively acquire tokens
Exploit on-chip locality and bandwidth
Respect cache hierarchy
Detecting and handling missed tokens

30
Hierarchical for Performance TokenCMP

Approach
On L1 miss, broadcast within own CMP
Local cache responds if possible
On L2 miss, broadcast to other CMPs
Appropriate L2 bank responds or broadcasts within
its CMP
Optionally filter
Responses between CMPs carry extra tokensfor
future locality
Handling missed tokens
Timeout after average memory latency
Invoke persistent request (no retries)
Larger systems can use filters, multicast,
soft-state directories

31
Other Optimizations in TokenCMP

Implementing E-state
Memory responds with all tokens on read request
Use clean/dirty owner distinction to eliminate
writing back unwritten data
Implementing Migratory Sharing
What is it?
A processors read request results in exclusive
permission if responder has exclusive permission
and wrote the block
In TokenCMP, simply return all tokens
Non-speculative delay
Hold block for some cycles so permission isnt
stolen prematurely

32
Another Look at Performance Policies

How to find tokens?
Broadcast
Broadcast w/ filters
Multicast (destination-set prediction)
Directories (soft or hard)
Who responds with data?
Owner token
TokenCMP uses Owner token for Inter-CMP responses
Other heuristics
For TokenCMP intra-CMP responses, cache responds
if it has extra tokens

33
Transient Requests May Reduce Complexity

Processor holds the only required state about
request
L2 controller in TokenCMP very simple
Re-broadcasts L1 request message on a miss
Re-broadcasts or filters external request
messages
Possible states
no tokens (I)
all tokens (M)
some tokens (S)
Bounce unexpected tokens to memory
DirectoryCMPs L2 controller is complex
Allocates MSHR on miss and forward
Issues invalidates and receives acks
Orders all intra-CMP requests and writebacks
57 states in our L2 implementation!

34
Writebacks

DirectoryCMP uses 3-phase writebacks
L1 issues writeback request
L2 enters transient state or blocks request
L2 responds with writeback ack
L1 sends data
TokenCMP uses fire-and-forget writebacks
Immediately send tokens and data
Heuristic Only send data if tokens gt 1

35
Outline

Motivation and Background
Token Coherence Flat for Correctness
Token Coherence Hierarchical for Performance
Evaluation
Model checking
Performance w/ commercial workloads
Robustness

36
TokenCMP Evaluation

Simple?
Some anecdotal examples and comparisons
Model checking
Fast?
Full-system simulation w/ commercial workloads
Robust?
Micro-benchmarks to simulate high contention

37
Complexity Evaluation with Model Checking

This work performed by Jesse Bingham and Alan Hu
of the University of British Columbia
Methods
TLA and TLC
DirectoryCMP omits all intra-CMP details
TokenCMPs correctness substrate modeled
Result
Complexity similar between TokenCMP and
non-hierarchical DirectoryCMP
Correctness Substrate verified to be correct and
deadlock-free
All possible performance protocols correct

38
Performance Evaluation

Target System
4 CMPs, 4 procs/cmp
2GHz OoO SPARC, 8MB shared L2 per chip
Directly connected interconnect
Methods Multifacet GEMS simulator
Simics augmented with timing models
Released soon http//www.cs.wisc.edu/gems
Benchmarks
Performance Apache, Spec, OLTP
Robustness Locking uBenchmark

39
Full-system Simulation Runtime

TokenCMP performs 9-50 faster than DirectoryCMP

40
Full-system Simulation Runtime

TokenCMP performs 9-50 faster than DirectoryCMP

DRAM Directory
Perfect L2
41
Full-system Simulation Inter-CMP Traffic

TokenCMP traffic is reasonable (or better)
DirectoryCMP control overhead greater than
broadcast for small system

42
Full-system Simulation Intra-CMP Traffic
43
Performance Robustness
Locking micro-benchmark
(correctness substrate only)
less contention
more contention
44
Performance Robustness
Locking micro-benchmark
(correctness substrate only)
less contention
more contention
45
Performance Robustness
Locking micro-benchmark
less contention
more contention
46
Summary

Microprocessor ? Chip Multiprocessor (CMP)
Symmetric Multiprocessor (SMP) ? Multiple CMPs
Problem Coherence with Multiple CMPs
Old Solution Hierarchical Directory Complex
Slow
New Solution Apply Token Coherence
Developed for glueless multiprocessor 2003
Keep Flat for Correctness
Exploit Hierarchical for performance
Less Complex Faster than Hierarchical Directory

Write a Comment

User Comments (0)

About PowerShow.com

Token Coherence: A Framework for Implementing Multiple-CMP Systems - PowerPoint PPT Presentation

Token Coherence: A Framework for Implementing Multiple-CMP Systems

Keep: Flat for Correctness. Exploit: Hierarchical ... remains flat. Tokens to ... Flat for correctness. Slide 14. Improving Multiple-CMP Systems using ... – PowerPoint PPT presentation