Memory Sharing Predictor: The key to speculative Coherent DSM - PowerPoint PPT Presentation

About This Presentation
Title:

Memory Sharing Predictor: The key to speculative Coherent DSM

Description:

Memory Sharing Predictor: The key to speculative Coherent DSM An-Chow Lai Babak Falsafi Purdue University Organization Introduction Directory based cache coherence ... – PowerPoint PPT presentation

Number of Views:113
Avg rating:3.0/5.0
Slides: 37
Provided by: Abhishe93
Category:

less

Transcript and Presenter's Notes

Title: Memory Sharing Predictor: The key to speculative Coherent DSM


1
Memory Sharing Predictor The key to speculative
Coherent DSM
  • An-Chow Lai
  • Babak Falsafi
  • Purdue University

2
Organization
  • Introduction
  • Directory based cache coherence
  • Pattern Based Message Predictors
  • Memory Sharing Predictors
  • Vector Memory Sharing predictors
  • Speculative Coherent operations
  • Performance Analysis
  • Results
  • Summary conclusions

3
Introduction
  • Distributed Shared Memory Multiprocessors
  • Provide a logical shared address space over
    physically distributed memory
  • Programming easier compared to SMPs.
  • Non-Uniform Memory Access(Bottleneck) Remote
    access far slower compared to local access.

DSM
4
  • Efforts to eliminate this difference
  • Custom designed motherboards cannot get benefit
    of excellent cost-performance of off-shelf
    motherboards
  • Reduce remote access frequency
  • Reduce coherence protocol overheadwill need
    complex adaptive coherence protocols.
  • Existing predictorsdirected to specific sharing
    patterns known a priori.
  • Pattern based predictors
  • Dynamically adapt to an applications sharing
    pattern at runtime
  • Does not modify the base coherence protocol
  • Memory Sharing Predictors Vector Memory Sharing
    Predictors
  • Topic of this paper
  • Improvement on general pattern based predictors
    proposed by Mukherjee Hill

5
Directory based cache coherence
Processor Caches
Processor Caches
Processor Caches
Processor Caches
Memory
I/O
Memory
I/O
Memory
I/O
Memory
I/O
Directory
Directory
Directory
Directory
Interconnection network
6
Directory based cache coherence
  • Directory Based cache Coherence Protocols
  • Each node maintains sharing information of all
    memory blocks
  • Based on a Finite state machine in which states
    directory state
  • actions messages
  • This paper uses half migratory protocol
  • Speculative Coherent DSM must accurately predict
    remote access and timely perform actions.

A remote read request
Directory protocol transitions
7
Pattern Based Message Predictors
  • Predicts the sender and type of next incoming
    message for a particular block.
  • Structure Similar to a two level branch
    predictor
  • History table captures most recent sequence of
    incoming messages for every memory block
  • Pattern table records all observed sequences of
    coherence messages for every memory block (An
    Entry Sequence of messages prediction message)

A two level Message predictor
8
Pattern Based Message Predictors(contd.)
  • Depth of History Table Register number of past
    messages, it keeps track of.
  • Deeper history depthgt more accurate prediction,
    no race conditions.
  • Deeper history depth gt Large Pattern history
    tablegt high cost.

9
Memory Sharing Predictors
  • Shortcomings of General Message Predictor
  • Invalidation messages may arrive in any order,
    thus may interfere with prediction of more
    necessary request messages
  • - It increases the number of pattern table
    entries (almost doubles)
  • It increases the number of bits needed to encode
    the messages (three requests two acks).
  • Observations
  • To eliminate the coherence overhead on remote
    access, only necessary to predict memory request
    messages (read ,write, upgrade).
  • Coherence acknowledgement message prediction
    extra overhead as they are always expected to
    arrive in response to a coherence action

10
Memory Sharing Predictors
  • MSP addresses these issues
  • predicting only the memory request messages
  • Since the acknowledgements are eliminated, all
    the effects of possible reordering of
    acknowledgements are eliminated.
  • Only 2 bits required to encode messages compared
    to 3 for general predictor

11
VMSP A Vector MSP
  • Observations
  • Full map protocol allows multiple processors to
    simultaneously cache read only copy of a memory
    block.
  • A predictor must identify the sharers and not
    maintain the order in which they are read.
  • Optimizations to MSP to get VMSP
  • Rather than record and predict read requests as
    individual pattern table entries, encode a
    sequence of read requests as a bit vector just
    like the directory maintains the list of sharers.

12
Vector Memory Sharing Predictor(contd.)
  • Benefits
  • reduces the number of pattern table entries
  • eliminates the effect of re-ordering of reads on
    size
  • Effect on history depth number of sharers
  • Good when the number of readers are
    large(gt(2n)/2log(n)).

13
Triggering Request Speculation
  • Important considerations
  • Predict what remote memory requests arrive
  • Predict when remote accesses arrive
  • Execute necessary coherence actions

A speculative coherent DSM node and coherence
hardware
14
Triggering Request Speculation
  • A) What remote memory request arrives somewhat
    simple from pattern history table (which stores
    what memory accesses take place)
  • B) When somewhat tough here
  • early speculation may take away block from its
    readers
  • Late speculation may incur additional delay and
    may limit DSMs ability to hide coherence
    overhead
  • was not a problem in COSMOS as all the coherence
    messages were being predicted but not sent. They
    were sent only after the previous message
    arrived. Since there are no coherence
    acknowledgement messages in the history table so
    timing is a problem now.

15
Triggering Request Speculation
  • Two ways to overcome
  • Speculative Write Invalidation
  • Based on common memory access patterns most
    producer consumer scenario Producer writes to a
    memory block and then no longer accesses until it
    has been read by consumers. Common in parallel
    commercial data base servers.
  • MSP predicts that a processor is done writing
    when the processor writes to some other memory
    location
  • Maintain a early write-invalidate table stores
    last address written by a processor.
  • If address in EWI table changes, trigger
    speculative write invalidate and subsequent
    reads.

16
Comparison with general Message predictor
P1 reader
P3 Directory
P2 Writer
P1 reader
P3 Directory
P2 writer
Time
Time
Write A
Read
invalidate
Write B
Invalidate
Writeback
Send block
Writeback
Prefetching starts
Send block
Read hit
17
Question?
  • What happens if while speculatively read data has
    been sent by P3 to P1, P1 has already made the
    request for data?

18
Question?
  • What happens if while speculatively read data has
    been sent by P3 to P1, P1 has already made the
    request for data?
  • --The DSM node on receiving that speculated
    message drops this message to avoid modifying the
    protocol.

19
Question?
  • What happens if P1 makes read request before P2
    does the second write?

20
Question?
  • What happens if P1 makes read request before P2
    does the second write?
  • First Read Protocol
  • 2) First Read
  • If SWI fails, then on the first read request
    made, all subsequent reads are triggered.

21
Speculative Coherence Operations
  • Final Action
  • execute a coherence action speculatively
  • verify the accuracy of the predictor
  • Requirements
  • Co-exist with the base coherence protocol without
    any protocol modifications
  • MSP simply advices the protocol to execute
    coherence operations. Any misspeculation results
    in additional coherence operations but no
    interference with protocol functionality
  • eg. A premature write invalidation results in
    additional read /write request by producer.
  • MSP will advice the protocol to send read-only
    block copies to requesters.

22
Verification of accuracy
  • Reference bit in remote cache of every block
    placed speculatively
  • On actual reference, remote cache clears the bit,
    verifying that the access occurred.
  • On invalidation of this block, reference bit is
    sent alongwith the invalidation message
  • The MSP at home node examines this bit and
    removes mispredicted messages.

23
Performance Analysis
  • Performance depends on
  • Speculation accuracy
  • Reduction in latency on successful speculation
  • Misspeculation penalty
  • Speculation opportunity A computationally
    intensive application will benefit little from
    speculation.
  • Assumptions
  • When speculative memory request is successfully
    executed, entire remote latency is hidden
  • Misspeculation only slows the remote access, does
    not increase the request frequency

24
Performance
  • Performance Model
  • c Applications communication ratio
  • f fraction of speculatively executed
    instructions over all
  • received requests
  • p request prediction accuracy
  • laccess local access latency
  • raccess remote access latency
  • rtl raccess /laccess
  • n misspeculation penalty factor
  • N number of remote requests on the critical path

25
Performance
  • Communication speedup is given by
  • (Comm time w/o speculation)/(comm time w/
    speculation)
  • Nraccess
  • -------------------------------
    -------------------
  • (1-f)Nraccess fN(placcess (1-p)nraccess)
  • 1
  • -------------------------------
    -------------------
  • (1-f) f (p/rtl
    n(1-p))
  • Total speedup is given by
  • (total execution time w/o speculation)/(total
    execution time w/ speculation)
  • 1
  • ----------------------------
    ---------------------
  • (1-c)
    c/(comm_speedup)

26
Speedup vs various parameters
Potential Speedup in a speculative coherent DSM
27
Speedups
  • Prediction accuracy plays prominent role in
    speedup
  • A low prediction accuracy of 10-50 results in
    slowdown due to high speculation overhead while a
    high prediction accuracy (90) increases speedup
    even for moderate communication ratios.
  • At high prediction rates, slowdown due to
    increasing misspeculation penalty is not
    significant
  • f fraction of speculated instructions, is a
    measure of number of request messages it takes to
    learn and predict. For rapidly changing patterns,
    even at high prediction accuracy, performance
    improvement will not be significant.
  • Speculative coherent Protocol impacts clusters
    most because of high rtl ratio.

28
Simulation results
  • Wisconsin wind tunnel II to simulate CC-Numa with
    16 nodes interconnected through hardware DSM
    boards to a low latency switched network.
  • Full map write invalidate protocol with 32 byte
    coherence blocks.
  • Benchmarks appbt, barnes, em3d, moldyn, ocean,
    tomcatv, unstructures.

29
Results
Base predictor accuracy comparison(history depth
1)
30
Results
  • Em3d, Moldyn exhibit producer/consumer sharing
    with small read sharing gt low impact of read
    ordering gt high performance with MSP.
  • Unstructured exhibits wide read-sharing in
    producer/consumer phase, hence MSP can get a
    prediction accuracy of less that 65 while VMSP
    can get almost 85.

31
Results
Prediction accuracy with varying history depths
32
Results
Messages predicted(correctly predicted) for a
history depth of 1
33
Results
Predictor storage overhead
34
Results
  • All predictors use 4 bits to encode processor id
  • Cosmos uses 3 bits to encode message type gt 7
    bits for history table entry and 14 bit per pte
    gt (714) bits per block
  • MSP and VMSP use 2 bits to encode a message type
  • MSP 12 bits per pte gt(612) bits per block
  • VMSP uses 18 bits per history table, but (186)
    bits per pte gt (1824) bits per block (in VMSP a
    read vector is always followed by a write/upgrade
    and vice versa). A pte will contain at most one
    entry.
  • MSP and VMSP require less storage compared to
    cosmos.

35
Summary and Conclusion
  • Proposed the Memory Sharing Predictor tom predict
    and execute coherence protocols speculatively.
  • MSP eliminates acknowledgement messages in
    pattern tables and increases prediction accuracy
    from 81 to 86.
  • VMSP further improves accuracy upto 93 using
    compact vector representations and eliminating
    perturbations due to read request reorderings.
  • VMSP also reduces implementation storage.
  • High accuracy predictors are key to high
    performance SC DSM.

36
  • Discussions
Write a Comment
User Comments (0)
About PowerShow.com