Title: Token Coherence: A Framework for Implementing Multiple-CMP Systems
1Token Coherence A Framework for Implementing
Multiple-CMP Systems
- Mike Marty1, Jesse Bingham2, Mark Hill1, Alan
Hu2, Milo Martin3, and David Wood1 - 1University of Wisconsin-Madison
- 2University of British Columbia
- 3University of Pennsylvania
- February 17th, 2005
2Summary
- Microprocessor ? Chip Multiprocessor (CMP)
- Symmetric Multiprocessor (SMP) ? Multiple CMPs
- Problem Coherence with Multiple CMPs
- Old Solution Hierarchical Directory Complex
Slow - New Solution Apply Token Coherence
- Developed for glueless multiprocessor 2003
- Keep Flat for Correctness
- Exploit Hierarchical for performance
- Less Complex Faster than Hierarchical Directory
3Outline
- Motivation and Background
- Coherence in Multiple-CMP Systems
- Example DirectoryCMP
- Token Coherence Flat for Correctness
- Token Coherence Hierarchical for Performance
- Evaluation
4Coherence in Multiple-CMP Systems
- Chip Multiprocessors (CMPs) emerging
- Larger systems will be built with Multiple CMPs
CMP 2
CMP 1
interconnect
CMP 3
CMP 4
5Problem Hierarchical Coherence
- Intra-CMP protocol for coherence within CMP
- Inter-CMP protocol for coherence between CMPs
- Interactions between protocols increase
complexity - explodes state space
CMP 2
CMP 1
interconnect
Inter-CMP Coherence
Intra-CMP Coherence
CMP 3
CMP 4
6Improving Multiple CMP Systems with Token
Coherence
- Token Coherence allows Multiple-CMP systems to
be... - Flat for correctness, but
- Hierarchical for performance
Low Complexity
Fast
Correctness Substrate
CMP 2
CMP 1
interconnect
Performance Protocol
CMP 3
CMP 4
7Example DirectoryCMP
2-level MOESI Directory
RACE CONDITIONS!
CMP 0
CMP 1
Store B
Store B
P0
P1
P2
P3
P4
P5
P6
P7
L1 ID
L1 ID
L1 ID
L1 ID
L1 ID
L1 ID
L1 ID
L1 ID
O
S
S
S
data/ ack
data/ ack
getx
WB
getx
inv
inv
ack
ack
inv
fwd
ack
data/ ack
Shared L2 / directory
Shared L2 / directory
S
getx
WB
fwd
getx
B S O
B M I
Memory/Directory
Memory/Directory
8Token Coherence Summary
- Token Coherence separates performance from
correctness - Correctness Substrate Enforces coherence
invariant and prevents starvation - Safety with Token Counting
- Starvation Avoidance with Persistent Requests
- Performance Policy Makes the common case fast
- Transient requests to seek tokens
- Unordered, untracked, unacknowledged
- Possible prediction, multicast, filters, etc
9Outline
- Motivation and Background
- Token Coherence Flat for Correctness
- Safety
- Starvation Avoidance
- Token Coherence Hierarchical for Performance
- Evaluation
10Example Token Coherence ISCA 2003
Store B
Load B
Load B
Store B
P0
P1
P2
P3
L1 ID
L1 ID
L1 ID
L1 ID
L2
L2
L2
L2
interconnect
mem 0
mem 3
- Each memory block initialized with T tokens
- Tokens stored in memory, caches, messages
- At least one token to read a block
- All tokens to write a block
11Extending to Multiple-CMP System
CMP 0
CMP 1
P0
P1
P2
P3
L1 ID
L1 ID
L1 ID
L1 ID
L2
L2
L2
L2
interconnect
interconnect
Shared L2
Shared L2
interconnect
mem 0
mem 1
12Extending to Multiple-CMP System
CMP 0
CMP 1
Store B
Store B
P0
P1
P2
P3
L1 ID
L1 ID
L1 ID
L1 ID
interconnect
interconnect
Shared L2
Shared L2
mem 0
mem 1
interconnect
- Token counting remains flat
- Tokens to caches
- Handles shared caches and other complex
hierarchies
13Safety Recap
- Safety Maintain coherence invariant
- Only one writer, or multiple readers
- Tokens for Safety
- T Tokens associated with each memory block
- tokens encoded in 1log2T
- Processor acquires all tokens to write, a single
token to read - Tokens passed to nodes in glueless multiprocessor
scheme - But CMPs have private and shared caches
- Tokens passed to caches in Multiple-CMP system
- Arbitrary cache hierarchy easily handled
- Flat for correctness
14Some Token Counting Implications
- Memory must store tokens
- Separate RAM
- Use extra ECC bits
- Token cache
- T sized to caches to allow read-only copies in
all caches - Replacements cannot be silent
- Tokens must not be lost or dropped
- Targeted for invalidate-based protocols
- Not a solution for write-through or update
protocols - Tokens must be identified by block address
- Address must be in all token-carrying messages
15Starvation Avoidance
- Request messages can miss tokens
- In-flight tokens
- Transient Requests are not tracked throughout
system - Incorrect filtering, multicast, destination-set
prediction, etc - Possible Solution Retries
- Retry w/ optional randomized backoff is effective
for races - Guaranteed Solution Persistent Requests
- Heavyweight request guaranteed to succeed
- Should be rare (uses more bandwidth)
- Locates all tokens in the system
- Orders competing requests
16Starvation Avoidance
CMP 0
CMP 1
Store B
Store B
Store B
P0
P1
P2
P3
L1 ID
L1 ID
L1 ID
L1 ID
interconnect
interconnect
Shared L2
Shared L2
mem 0
mem 1
interconnect
- Tokens move freely in the system
- Transient requests can miss in-flight tokens
- Incorrect speculation, filters, prediction, etc
17Starvation Avoidance
CMP 0
CMP 1
Store B
Store B
Store B
P0
P1
P2
P3
L1 ID
L1 ID
L1 ID
L1 ID
interconnect
interconnect
Shared L2
Shared L2
mem 0
mem 1
interconnect
- Solution issue Persistent Request
- Heavyweight request guaranteed to succeed
- Methods Centralized 2003 and Distributed
(New)
18Old Scheme Central Arbiter 2003
CMP 0
CMP 1
Store B
Store B
Store B
timeout
timeout
timeout
P0
P1
P2
P3
L1 ID
L1 ID
L1 ID
L1 ID
interconnect
interconnect
Shared L2
Shared L2
mem 0
mem 1
interconnect
arbiter 0
B P0
arbiter 0
B P2
B P1
- Processors issue persistent requests
19Old Scheme Central Arbiter 2003
CMP 0
CMP 1
Store B
Store B
Store B
Store B
P0
P1
P2
P3
L1 ID
L1 ID
L1 ID
L1 ID
B P0
B P0
B P0
B P0
interconnect
interconnect
Shared L2
Shared L2
B P0
B P0
mem 0
mem 1
interconnect
arbiter 0
B P0
arbiter 0
B P2
B P1
- Processors issue persistent requests
- Arbiter orders and broadcasts activate
20Old Scheme Central Arbiter 2003
CMP 0
CMP 1
Store B
Store B
Store B
P0
P1
P2
P3
L1 ID
L1 ID
L1 ID
L1 ID
B P0
B P0
B P0
B P0
B P2
B P2
B P2
B P2
3
interconnect
interconnect
Shared L2
Shared L2
B P0
B P2
B P0
B P2
1
2
mem 0
mem 1
interconnect
arbiter 0
B P0
arbiter 0
B P2
B P2
B P1
- Processor sends deactivate to arbiter
- Arbiter broadcasts deactivate (and next activate)
- Bottom Line handoff is 3 message latencies
21Improved Scheme Distributed Arbitration NEW
CMP 0
CMP 1
Store B
Store B
Store B
P0
P1
P2
P3
P0 B
P0 B
P0 B
P0 B
P1 B
P1 B
P1 B
P1 B
L1 ID
L1 ID
L1 ID
L1 ID
P2 B
P2 B
P2 B
P2 B
interconnect
interconnect
P0 B
P0 B
Shared L2
Shared L2
P1 B
P1 B
P2 B
P2 B
mem 0
mem 1
interconnect
P0 B
P1 B
P2 B
- Processors broadcast persistent requests
22Improved Scheme Distributed Arbitration NEW
CMP 0
CMP 1
Store B
Store B
Store B
P0
P1
P2
P3
P0 B
P0 B
P0 B
P0 B
P0 B
P0 B
P0 B
P0 B
P1 B
P1 B
P1 B
P1 B
L1 ID
L1 ID
L1 ID
L1 ID
P2 B
P2 B
P2 B
P2 B
interconnect
interconnect
P0 B
P0 B
P0 B
P0 B
Shared L2
Shared L2
P1 B
P1 B
P2 B
P2 B
mem 0
mem 1
interconnect
P0 B
P0 B
P1 B
P2 B
- Processors broadcast persistent requests
- Fixed priority (processor number)
23Improved Scheme Distributed Arbitration NEW
CMP 0
CMP 1
Store B
Store B
P0
P1
P2
P3
P0 B
P0 B
P0 B
P0 B
P1 B
P1 B
P1 B
P1 B
P1 B
P1 B
P1 B
P1 B
1
L1 ID
L1 ID
L1 ID
L1 ID
P2 B
P2 B
P2 B
P2 B
interconnect
interconnect
P0 B
P0 B
Shared L2
Shared L2
P1 B
P1 B
P1 B
P1 B
P2 B
P2 B
mem 0
mem 1
interconnect
P0 B
P1 B
P1 B
P2 B
- Processors broadcast persistent requests
- Fixed priority (processor number)
- Processors broadcast deactivate
24Improved Scheme Distributed Arbitration NEW
CMP 0
CMP 1
P0
P1
P2
P3
P1 B
P1 B
P1 B
P1 B
P1 B
P1 B
P1 B
P1 B
L1 ID
L1 ID
L1 ID
L1 ID
P2 B
P2 B
P2 B
P2 B
interconnect
interconnect
Shared L2
Shared L2
P1 B
P1 B
P1 B
P1 B
P2 B
P2 B
mem 0
mem 1
interconnect
P1 B
P1 B
P2 B
- Bottom line Handoff is a single message latency
- Subtle point P0 and P1 must wait until next
wave
25Implementing Distributed Persistent Requests
- Table at each cache
- Sized to N entries for each processor (we use
N1) - Indexed by processor ID
- Content-addressable by Address
- Each incoming message must access table
- Not on the critical path can be slow CAM
- Activate/deactivate reordering cannot be allowed
- Persistent request virtual channel must be
point-to-point ordered - Or, other solution such as sequence numbers or
acks
26Implementing Distributed Persistent Requests
- Should reads be distinguished from writes?
- Not necessary, but
- Persistent Read request is helpful
- Implications of flat distributed arbitration
- Simple ? flat for correctness
- Global broadcast when used
- Fortunately they are rare in typical workloads
(0.3) - Bad workload (very high contention) would burn
bandwidth - Maximum processors must be architected
- What about a hierarchical persistent request
scheme? - Possible, but correctness is no longer flat
- Make the common case fast
27Reducing Unnecessary Traffic
- Problem Which token-holding cache responds with
data? - Solution Distinguish one token as the owner
token - The owner includes data with token response
- Clean vs. dirty owner distinction also useful for
writebacks
28Outline
- Motivation and Background
- Token Coherence Flat for Correctness
- Token Coherence Hierarchical for Performance
- TokenCMP
- Another look at performance policies
- Evaluation
29Hierarchical for Performance TokenCMP
- Target System
- 2-8 CMPs
- Private L1s, shared L2 per CMP
- Any interconnect, but high-bandwidth
- Performance Policy Goals
- Aggressively acquire tokens
- Exploit on-chip locality and bandwidth
- Respect cache hierarchy
- Detecting and handling missed tokens
30Hierarchical for Performance TokenCMP
- Approach
- On L1 miss, broadcast within own CMP
- Local cache responds if possible
- On L2 miss, broadcast to other CMPs
- Appropriate L2 bank responds or broadcasts within
its CMP - Optionally filter
- Responses between CMPs carry extra tokensfor
future locality - Handling missed tokens
- Timeout after average memory latency
- Invoke persistent request (no retries)
- Larger systems can use filters, multicast,
soft-state directories
31Other Optimizations in TokenCMP
- Implementing E-state
- Memory responds with all tokens on read request
- Use clean/dirty owner distinction to eliminate
writing back unwritten data - Implementing Migratory Sharing
- What is it?
- A processors read request results in exclusive
permission if responder has exclusive permission
and wrote the block - In TokenCMP, simply return all tokens
- Non-speculative delay
- Hold block for some cycles so permission isnt
stolen prematurely
32Another Look at Performance Policies
- How to find tokens?
- Broadcast
- Broadcast w/ filters
- Multicast (destination-set prediction)
- Directories (soft or hard)
- Who responds with data?
- Owner token
- TokenCMP uses Owner token for Inter-CMP responses
- Other heuristics
- For TokenCMP intra-CMP responses, cache responds
if it has extra tokens
33Transient Requests May Reduce Complexity
- Processor holds the only required state about
request - L2 controller in TokenCMP very simple
- Re-broadcasts L1 request message on a miss
- Re-broadcasts or filters external request
messages - Possible states
- no tokens (I)
- all tokens (M)
- some tokens (S)
- Bounce unexpected tokens to memory
- DirectoryCMPs L2 controller is complex
- Allocates MSHR on miss and forward
- Issues invalidates and receives acks
- Orders all intra-CMP requests and writebacks
- 57 states in our L2 implementation!
34Writebacks
- DirectoryCMP uses 3-phase writebacks
- L1 issues writeback request
- L2 enters transient state or blocks request
- L2 responds with writeback ack
- L1 sends data
- TokenCMP uses fire-and-forget writebacks
- Immediately send tokens and data
- Heuristic Only send data if tokens gt 1
35Outline
- Motivation and Background
- Token Coherence Flat for Correctness
- Token Coherence Hierarchical for Performance
- Evaluation
- Model checking
- Performance w/ commercial workloads
- Robustness
36TokenCMP Evaluation
- Simple?
- Some anecdotal examples and comparisons
- Model checking
- Fast?
- Full-system simulation w/ commercial workloads
- Robust?
- Micro-benchmarks to simulate high contention
37Complexity Evaluation with Model Checking
- This work performed by Jesse Bingham and Alan Hu
of the University of British Columbia - Methods
- TLA and TLC
- DirectoryCMP omits all intra-CMP details
- TokenCMPs correctness substrate modeled
- Result
- Complexity similar between TokenCMP and
non-hierarchical DirectoryCMP - Correctness Substrate verified to be correct and
deadlock-free - All possible performance protocols correct
38Performance Evaluation
- Target System
- 4 CMPs, 4 procs/cmp
- 2GHz OoO SPARC, 8MB shared L2 per chip
- Directly connected interconnect
- Methods Multifacet GEMS simulator
- Simics augmented with timing models
- Released soon http//www.cs.wisc.edu/gems
- Benchmarks
- Performance Apache, Spec, OLTP
- Robustness Locking uBenchmark
39Full-system Simulation Runtime
- TokenCMP performs 9-50 faster than DirectoryCMP
40Full-system Simulation Runtime
- TokenCMP performs 9-50 faster than DirectoryCMP
DRAM Directory
Perfect L2
41Full-system Simulation Inter-CMP Traffic
- TokenCMP traffic is reasonable (or better)
- DirectoryCMP control overhead greater than
broadcast for small system
42Full-system Simulation Intra-CMP Traffic
43Performance Robustness
Locking micro-benchmark
(correctness substrate only)
less contention
more contention
44Performance Robustness
Locking micro-benchmark
(correctness substrate only)
less contention
more contention
45Performance Robustness
Locking micro-benchmark
less contention
more contention
46Summary
- Microprocessor ? Chip Multiprocessor (CMP)
- Symmetric Multiprocessor (SMP) ? Multiple CMPs
- Problem Coherence with Multiple CMPs
- Old Solution Hierarchical Directory Complex
Slow - New Solution Apply Token Coherence
- Developed for glueless multiprocessor 2003
- Keep Flat for Correctness
- Exploit Hierarchical for performance
- Less Complex Faster than Hierarchical Directory