Token Coherence: A Framework for Implementing Multiple-CMP Systems - PowerPoint PPT Presentation

1 / 46
About This Presentation
Title:

Token Coherence: A Framework for Implementing Multiple-CMP Systems

Description:

Keep: Flat for Correctness. Exploit: Hierarchical ... remains flat. Tokens to ... Flat for correctness. Slide 14. Improving Multiple-CMP Systems using ... – PowerPoint PPT presentation

Number of Views:38
Avg rating:3.0/5.0
Slides: 47
Provided by: Multiface3
Category:

less

Transcript and Presenter's Notes

Title: Token Coherence: A Framework for Implementing Multiple-CMP Systems


1
Token Coherence A Framework for Implementing
Multiple-CMP Systems
  • Mike Marty1, Jesse Bingham2, Mark Hill1, Alan
    Hu2, Milo Martin3, and David Wood1
  • 1University of Wisconsin-Madison
  • 2University of British Columbia
  • 3University of Pennsylvania
  • February 17th, 2005

2
Summary
  • Microprocessor ? Chip Multiprocessor (CMP)
  • Symmetric Multiprocessor (SMP) ? Multiple CMPs
  • Problem Coherence with Multiple CMPs
  • Old Solution Hierarchical Directory Complex
    Slow
  • New Solution Apply Token Coherence
  • Developed for glueless multiprocessor 2003
  • Keep Flat for Correctness
  • Exploit Hierarchical for performance
  • Less Complex Faster than Hierarchical Directory

3
Outline
  • Motivation and Background
  • Coherence in Multiple-CMP Systems
  • Example DirectoryCMP
  • Token Coherence Flat for Correctness
  • Token Coherence Hierarchical for Performance
  • Evaluation

4
Coherence in Multiple-CMP Systems
  • Chip Multiprocessors (CMPs) emerging
  • Larger systems will be built with Multiple CMPs

CMP 2
CMP 1
interconnect
CMP 3
CMP 4
5
Problem Hierarchical Coherence
  • Intra-CMP protocol for coherence within CMP
  • Inter-CMP protocol for coherence between CMPs
  • Interactions between protocols increase
    complexity
  • explodes state space

CMP 2
CMP 1
interconnect
Inter-CMP Coherence
Intra-CMP Coherence
CMP 3
CMP 4
6
Improving Multiple CMP Systems with Token
Coherence
  • Token Coherence allows Multiple-CMP systems to
    be...
  • Flat for correctness, but
  • Hierarchical for performance

Low Complexity
Fast
Correctness Substrate
CMP 2
CMP 1
interconnect
Performance Protocol
CMP 3
CMP 4
7
Example DirectoryCMP
2-level MOESI Directory
RACE CONDITIONS!
CMP 0
CMP 1
Store B
Store B
P0
P1
P2
P3
P4
P5
P6
P7
L1 ID
L1 ID
L1 ID
L1 ID
L1 ID
L1 ID
L1 ID
L1 ID
O
S
S
S
data/ ack
data/ ack
getx
WB
getx
inv
inv
ack
ack
inv
fwd
ack
data/ ack
Shared L2 / directory
Shared L2 / directory
S
getx
WB
fwd
getx
B S O
B M I
Memory/Directory
Memory/Directory
8
Token Coherence Summary
  • Token Coherence separates performance from
    correctness
  • Correctness Substrate Enforces coherence
    invariant and prevents starvation
  • Safety with Token Counting
  • Starvation Avoidance with Persistent Requests
  • Performance Policy Makes the common case fast
  • Transient requests to seek tokens
  • Unordered, untracked, unacknowledged
  • Possible prediction, multicast, filters, etc

9
Outline
  • Motivation and Background
  • Token Coherence Flat for Correctness
  • Safety
  • Starvation Avoidance
  • Token Coherence Hierarchical for Performance
  • Evaluation

10
Example Token Coherence ISCA 2003
Store B
Load B
Load B
Store B
P0
P1
P2
P3
L1 ID
L1 ID
L1 ID
L1 ID
L2
L2
L2
L2
interconnect
mem 0
mem 3
  • Each memory block initialized with T tokens
  • Tokens stored in memory, caches, messages
  • At least one token to read a block
  • All tokens to write a block

11
Extending to Multiple-CMP System
CMP 0
CMP 1
P0
P1
P2
P3
L1 ID
L1 ID
L1 ID
L1 ID
L2
L2
L2
L2
interconnect
interconnect
Shared L2
Shared L2
interconnect
mem 0
mem 1
12
Extending to Multiple-CMP System
CMP 0
CMP 1
Store B
Store B
P0
P1
P2
P3
L1 ID
L1 ID
L1 ID
L1 ID
interconnect
interconnect
Shared L2
Shared L2
mem 0
mem 1
interconnect
  • Token counting remains flat
  • Tokens to caches
  • Handles shared caches and other complex
    hierarchies

13
Safety Recap
  • Safety Maintain coherence invariant
  • Only one writer, or multiple readers
  • Tokens for Safety
  • T Tokens associated with each memory block
  • tokens encoded in 1log2T
  • Processor acquires all tokens to write, a single
    token to read
  • Tokens passed to nodes in glueless multiprocessor
    scheme
  • But CMPs have private and shared caches
  • Tokens passed to caches in Multiple-CMP system
  • Arbitrary cache hierarchy easily handled
  • Flat for correctness

14
Some Token Counting Implications
  • Memory must store tokens
  • Separate RAM
  • Use extra ECC bits
  • Token cache
  • T sized to caches to allow read-only copies in
    all caches
  • Replacements cannot be silent
  • Tokens must not be lost or dropped
  • Targeted for invalidate-based protocols
  • Not a solution for write-through or update
    protocols
  • Tokens must be identified by block address
  • Address must be in all token-carrying messages

15
Starvation Avoidance
  • Request messages can miss tokens
  • In-flight tokens
  • Transient Requests are not tracked throughout
    system
  • Incorrect filtering, multicast, destination-set
    prediction, etc
  • Possible Solution Retries
  • Retry w/ optional randomized backoff is effective
    for races
  • Guaranteed Solution Persistent Requests
  • Heavyweight request guaranteed to succeed
  • Should be rare (uses more bandwidth)
  • Locates all tokens in the system
  • Orders competing requests

16
Starvation Avoidance
CMP 0
CMP 1
Store B
Store B
Store B
P0
P1
P2
P3
L1 ID
L1 ID
L1 ID
L1 ID
interconnect
interconnect
Shared L2
Shared L2
mem 0
mem 1
interconnect
  • Tokens move freely in the system
  • Transient requests can miss in-flight tokens
  • Incorrect speculation, filters, prediction, etc

17
Starvation Avoidance
CMP 0
CMP 1
Store B
Store B
Store B
P0
P1
P2
P3
L1 ID
L1 ID
L1 ID
L1 ID
interconnect
interconnect
Shared L2
Shared L2
mem 0
mem 1
interconnect
  • Solution issue Persistent Request
  • Heavyweight request guaranteed to succeed
  • Methods Centralized 2003 and Distributed
    (New)

18
Old Scheme Central Arbiter 2003
CMP 0
CMP 1
Store B
Store B
Store B
timeout
timeout
timeout
P0
P1
P2
P3
L1 ID
L1 ID
L1 ID
L1 ID
interconnect
interconnect
Shared L2
Shared L2
mem 0
mem 1
interconnect
arbiter 0
B P0
arbiter 0
B P2
B P1
  • Processors issue persistent requests

19
Old Scheme Central Arbiter 2003
CMP 0
CMP 1
Store B
Store B
Store B
Store B
P0
P1
P2
P3
L1 ID
L1 ID
L1 ID
L1 ID
B P0
B P0
B P0
B P0
interconnect
interconnect
Shared L2
Shared L2
B P0
B P0
mem 0
mem 1
interconnect
arbiter 0
B P0
arbiter 0
B P2
B P1
  • Processors issue persistent requests
  • Arbiter orders and broadcasts activate

20
Old Scheme Central Arbiter 2003
CMP 0
CMP 1
Store B
Store B
Store B
P0
P1
P2
P3
L1 ID
L1 ID
L1 ID
L1 ID
B P0
B P0
B P0
B P0
B P2
B P2
B P2
B P2
3
interconnect
interconnect
Shared L2
Shared L2
B P0
B P2
B P0
B P2
1
2
mem 0
mem 1
interconnect
arbiter 0
B P0
arbiter 0
B P2
B P2
B P1
  • Processor sends deactivate to arbiter
  • Arbiter broadcasts deactivate (and next activate)
  • Bottom Line handoff is 3 message latencies

21
Improved Scheme Distributed Arbitration NEW
CMP 0
CMP 1
Store B
Store B
Store B
P0
P1
P2
P3
P0 B
P0 B
P0 B
P0 B
P1 B
P1 B
P1 B
P1 B
L1 ID
L1 ID
L1 ID
L1 ID
P2 B
P2 B
P2 B
P2 B
interconnect
interconnect
P0 B
P0 B
Shared L2
Shared L2
P1 B
P1 B
P2 B
P2 B
mem 0
mem 1
interconnect
P0 B
P1 B
P2 B
  • Processors broadcast persistent requests

22
Improved Scheme Distributed Arbitration NEW
CMP 0
CMP 1
Store B
Store B
Store B
P0
P1
P2
P3
P0 B
P0 B
P0 B
P0 B
P0 B
P0 B
P0 B
P0 B
P1 B
P1 B
P1 B
P1 B
L1 ID
L1 ID
L1 ID
L1 ID
P2 B
P2 B
P2 B
P2 B
interconnect
interconnect
P0 B
P0 B
P0 B
P0 B
Shared L2
Shared L2
P1 B
P1 B
P2 B
P2 B
mem 0
mem 1
interconnect
P0 B
P0 B
P1 B
P2 B
  • Processors broadcast persistent requests
  • Fixed priority (processor number)

23
Improved Scheme Distributed Arbitration NEW
CMP 0
CMP 1
Store B
Store B
P0
P1
P2
P3
P0 B
P0 B
P0 B
P0 B
P1 B
P1 B
P1 B
P1 B
P1 B
P1 B
P1 B
P1 B
1
L1 ID
L1 ID
L1 ID
L1 ID
P2 B
P2 B
P2 B
P2 B
interconnect
interconnect
P0 B
P0 B
Shared L2
Shared L2
P1 B
P1 B
P1 B
P1 B
P2 B
P2 B
mem 0
mem 1
interconnect
P0 B
P1 B
P1 B
P2 B
  • Processors broadcast persistent requests
  • Fixed priority (processor number)
  • Processors broadcast deactivate

24
Improved Scheme Distributed Arbitration NEW
CMP 0
CMP 1
P0
P1
P2
P3
P1 B
P1 B
P1 B
P1 B
P1 B
P1 B
P1 B
P1 B
L1 ID
L1 ID
L1 ID
L1 ID
P2 B
P2 B
P2 B
P2 B
interconnect
interconnect
Shared L2
Shared L2
P1 B
P1 B
P1 B
P1 B
P2 B
P2 B
mem 0
mem 1
interconnect
P1 B
P1 B
P2 B
  • Bottom line Handoff is a single message latency
  • Subtle point P0 and P1 must wait until next
    wave

25
Implementing Distributed Persistent Requests
  • Table at each cache
  • Sized to N entries for each processor (we use
    N1)
  • Indexed by processor ID
  • Content-addressable by Address
  • Each incoming message must access table
  • Not on the critical path can be slow CAM
  • Activate/deactivate reordering cannot be allowed
  • Persistent request virtual channel must be
    point-to-point ordered
  • Or, other solution such as sequence numbers or
    acks

26
Implementing Distributed Persistent Requests
  • Should reads be distinguished from writes?
  • Not necessary, but
  • Persistent Read request is helpful
  • Implications of flat distributed arbitration
  • Simple ? flat for correctness
  • Global broadcast when used
  • Fortunately they are rare in typical workloads
    (0.3)
  • Bad workload (very high contention) would burn
    bandwidth
  • Maximum processors must be architected
  • What about a hierarchical persistent request
    scheme?
  • Possible, but correctness is no longer flat
  • Make the common case fast

27
Reducing Unnecessary Traffic
  • Problem Which token-holding cache responds with
    data?
  • Solution Distinguish one token as the owner
    token
  • The owner includes data with token response
  • Clean vs. dirty owner distinction also useful for
    writebacks

28
Outline
  • Motivation and Background
  • Token Coherence Flat for Correctness
  • Token Coherence Hierarchical for Performance
  • TokenCMP
  • Another look at performance policies
  • Evaluation

29
Hierarchical for Performance TokenCMP
  • Target System
  • 2-8 CMPs
  • Private L1s, shared L2 per CMP
  • Any interconnect, but high-bandwidth
  • Performance Policy Goals
  • Aggressively acquire tokens
  • Exploit on-chip locality and bandwidth
  • Respect cache hierarchy
  • Detecting and handling missed tokens

30
Hierarchical for Performance TokenCMP
  • Approach
  • On L1 miss, broadcast within own CMP
  • Local cache responds if possible
  • On L2 miss, broadcast to other CMPs
  • Appropriate L2 bank responds or broadcasts within
    its CMP
  • Optionally filter
  • Responses between CMPs carry extra tokensfor
    future locality
  • Handling missed tokens
  • Timeout after average memory latency
  • Invoke persistent request (no retries)
  • Larger systems can use filters, multicast,
    soft-state directories

31
Other Optimizations in TokenCMP
  • Implementing E-state
  • Memory responds with all tokens on read request
  • Use clean/dirty owner distinction to eliminate
    writing back unwritten data
  • Implementing Migratory Sharing
  • What is it?
  • A processors read request results in exclusive
    permission if responder has exclusive permission
    and wrote the block
  • In TokenCMP, simply return all tokens
  • Non-speculative delay
  • Hold block for some cycles so permission isnt
    stolen prematurely

32
Another Look at Performance Policies
  • How to find tokens?
  • Broadcast
  • Broadcast w/ filters
  • Multicast (destination-set prediction)
  • Directories (soft or hard)
  • Who responds with data?
  • Owner token
  • TokenCMP uses Owner token for Inter-CMP responses
  • Other heuristics
  • For TokenCMP intra-CMP responses, cache responds
    if it has extra tokens

33
Transient Requests May Reduce Complexity
  • Processor holds the only required state about
    request
  • L2 controller in TokenCMP very simple
  • Re-broadcasts L1 request message on a miss
  • Re-broadcasts or filters external request
    messages
  • Possible states
  • no tokens (I)
  • all tokens (M)
  • some tokens (S)
  • Bounce unexpected tokens to memory
  • DirectoryCMPs L2 controller is complex
  • Allocates MSHR on miss and forward
  • Issues invalidates and receives acks
  • Orders all intra-CMP requests and writebacks
  • 57 states in our L2 implementation!

34
Writebacks
  • DirectoryCMP uses 3-phase writebacks
  • L1 issues writeback request
  • L2 enters transient state or blocks request
  • L2 responds with writeback ack
  • L1 sends data
  • TokenCMP uses fire-and-forget writebacks
  • Immediately send tokens and data
  • Heuristic Only send data if tokens gt 1

35
Outline
  • Motivation and Background
  • Token Coherence Flat for Correctness
  • Token Coherence Hierarchical for Performance
  • Evaluation
  • Model checking
  • Performance w/ commercial workloads
  • Robustness

36
TokenCMP Evaluation
  • Simple?
  • Some anecdotal examples and comparisons
  • Model checking
  • Fast?
  • Full-system simulation w/ commercial workloads
  • Robust?
  • Micro-benchmarks to simulate high contention

37
Complexity Evaluation with Model Checking
  • This work performed by Jesse Bingham and Alan Hu
    of the University of British Columbia
  • Methods
  • TLA and TLC
  • DirectoryCMP omits all intra-CMP details
  • TokenCMPs correctness substrate modeled
  • Result
  • Complexity similar between TokenCMP and
    non-hierarchical DirectoryCMP
  • Correctness Substrate verified to be correct and
    deadlock-free
  • All possible performance protocols correct

38
Performance Evaluation
  • Target System
  • 4 CMPs, 4 procs/cmp
  • 2GHz OoO SPARC, 8MB shared L2 per chip
  • Directly connected interconnect
  • Methods Multifacet GEMS simulator
  • Simics augmented with timing models
  • Released soon http//www.cs.wisc.edu/gems
  • Benchmarks
  • Performance Apache, Spec, OLTP
  • Robustness Locking uBenchmark

39
Full-system Simulation Runtime
  • TokenCMP performs 9-50 faster than DirectoryCMP

40
Full-system Simulation Runtime
  • TokenCMP performs 9-50 faster than DirectoryCMP

DRAM Directory
Perfect L2
41
Full-system Simulation Inter-CMP Traffic
  • TokenCMP traffic is reasonable (or better)
  • DirectoryCMP control overhead greater than
    broadcast for small system

42
Full-system Simulation Intra-CMP Traffic
43
Performance Robustness
Locking micro-benchmark
(correctness substrate only)
less contention
more contention
44
Performance Robustness
Locking micro-benchmark
(correctness substrate only)
less contention
more contention
45
Performance Robustness
Locking micro-benchmark
less contention
more contention
46
Summary
  • Microprocessor ? Chip Multiprocessor (CMP)
  • Symmetric Multiprocessor (SMP) ? Multiple CMPs
  • Problem Coherence with Multiple CMPs
  • Old Solution Hierarchical Directory Complex
    Slow
  • New Solution Apply Token Coherence
  • Developed for glueless multiprocessor 2003
  • Keep Flat for Correctness
  • Exploit Hierarchical for performance
  • Less Complex Faster than Hierarchical Directory
Write a Comment
User Comments (0)
About PowerShow.com