Title: Memory Sharing Predictor: The key to speculative Coherent DSM
1Memory Sharing Predictor The key to speculative
Coherent DSM
- An-Chow Lai
- Babak Falsafi
- Purdue University
2Organization
- Introduction
- Directory based cache coherence
- Pattern Based Message Predictors
- Memory Sharing Predictors
- Vector Memory Sharing predictors
- Speculative Coherent operations
- Performance Analysis
- Results
- Summary conclusions
3Introduction
- Distributed Shared Memory Multiprocessors
- Provide a logical shared address space over
physically distributed memory - Programming easier compared to SMPs.
- Non-Uniform Memory Access(Bottleneck) Remote
access far slower compared to local access.
DSM
4- Efforts to eliminate this difference
- Custom designed motherboards cannot get benefit
of excellent cost-performance of off-shelf
motherboards - Reduce remote access frequency
- Reduce coherence protocol overheadwill need
complex adaptive coherence protocols. - Existing predictorsdirected to specific sharing
patterns known a priori. - Pattern based predictors
- Dynamically adapt to an applications sharing
pattern at runtime - Does not modify the base coherence protocol
- Memory Sharing Predictors Vector Memory Sharing
Predictors - Topic of this paper
- Improvement on general pattern based predictors
proposed by Mukherjee Hill
5Directory based cache coherence
Processor Caches
Processor Caches
Processor Caches
Processor Caches
Memory
I/O
Memory
I/O
Memory
I/O
Memory
I/O
Directory
Directory
Directory
Directory
Interconnection network
6Directory based cache coherence
- Directory Based cache Coherence Protocols
- Each node maintains sharing information of all
memory blocks - Based on a Finite state machine in which states
directory state - actions messages
- This paper uses half migratory protocol
- Speculative Coherent DSM must accurately predict
remote access and timely perform actions.
A remote read request
Directory protocol transitions
7Pattern Based Message Predictors
- Predicts the sender and type of next incoming
message for a particular block. - Structure Similar to a two level branch
predictor - History table captures most recent sequence of
incoming messages for every memory block - Pattern table records all observed sequences of
coherence messages for every memory block (An
Entry Sequence of messages prediction message)
A two level Message predictor
8Pattern Based Message Predictors(contd.)
- Depth of History Table Register number of past
messages, it keeps track of. - Deeper history depthgt more accurate prediction,
no race conditions. - Deeper history depth gt Large Pattern history
tablegt high cost.
9Memory Sharing Predictors
- Shortcomings of General Message Predictor
- Invalidation messages may arrive in any order,
thus may interfere with prediction of more
necessary request messages - - It increases the number of pattern table
entries (almost doubles) - It increases the number of bits needed to encode
the messages (three requests two acks). - Observations
- To eliminate the coherence overhead on remote
access, only necessary to predict memory request
messages (read ,write, upgrade). - Coherence acknowledgement message prediction
extra overhead as they are always expected to
arrive in response to a coherence action -
10Memory Sharing Predictors
- MSP addresses these issues
- predicting only the memory request messages
- Since the acknowledgements are eliminated, all
the effects of possible reordering of
acknowledgements are eliminated. - Only 2 bits required to encode messages compared
to 3 for general predictor
11VMSP A Vector MSP
- Observations
- Full map protocol allows multiple processors to
simultaneously cache read only copy of a memory
block. - A predictor must identify the sharers and not
maintain the order in which they are read. - Optimizations to MSP to get VMSP
- Rather than record and predict read requests as
individual pattern table entries, encode a
sequence of read requests as a bit vector just
like the directory maintains the list of sharers.
12Vector Memory Sharing Predictor(contd.)
- Benefits
- reduces the number of pattern table entries
- eliminates the effect of re-ordering of reads on
size - Effect on history depth number of sharers
- Good when the number of readers are
large(gt(2n)/2log(n)).
13Triggering Request Speculation
- Important considerations
- Predict what remote memory requests arrive
- Predict when remote accesses arrive
- Execute necessary coherence actions
A speculative coherent DSM node and coherence
hardware
14Triggering Request Speculation
- A) What remote memory request arrives somewhat
simple from pattern history table (which stores
what memory accesses take place) - B) When somewhat tough here
- early speculation may take away block from its
readers - Late speculation may incur additional delay and
may limit DSMs ability to hide coherence
overhead - was not a problem in COSMOS as all the coherence
messages were being predicted but not sent. They
were sent only after the previous message
arrived. Since there are no coherence
acknowledgement messages in the history table so
timing is a problem now.
15Triggering Request Speculation
- Two ways to overcome
- Speculative Write Invalidation
- Based on common memory access patterns most
producer consumer scenario Producer writes to a
memory block and then no longer accesses until it
has been read by consumers. Common in parallel
commercial data base servers. - MSP predicts that a processor is done writing
when the processor writes to some other memory
location - Maintain a early write-invalidate table stores
last address written by a processor. - If address in EWI table changes, trigger
speculative write invalidate and subsequent
reads.
16Comparison with general Message predictor
P1 reader
P3 Directory
P2 Writer
P1 reader
P3 Directory
P2 writer
Time
Time
Write A
Read
invalidate
Write B
Invalidate
Writeback
Send block
Writeback
Prefetching starts
Send block
Read hit
17Question?
- What happens if while speculatively read data has
been sent by P3 to P1, P1 has already made the
request for data?
18Question?
- What happens if while speculatively read data has
been sent by P3 to P1, P1 has already made the
request for data? - --The DSM node on receiving that speculated
message drops this message to avoid modifying the
protocol.
19Question?
- What happens if P1 makes read request before P2
does the second write?
20Question?
- What happens if P1 makes read request before P2
does the second write? - First Read Protocol
- 2) First Read
- If SWI fails, then on the first read request
made, all subsequent reads are triggered. -
21Speculative Coherence Operations
- Final Action
- execute a coherence action speculatively
- verify the accuracy of the predictor
- Requirements
- Co-exist with the base coherence protocol without
any protocol modifications - MSP simply advices the protocol to execute
coherence operations. Any misspeculation results
in additional coherence operations but no
interference with protocol functionality - eg. A premature write invalidation results in
additional read /write request by producer. - MSP will advice the protocol to send read-only
block copies to requesters. -
22Verification of accuracy
- Reference bit in remote cache of every block
placed speculatively - On actual reference, remote cache clears the bit,
verifying that the access occurred. - On invalidation of this block, reference bit is
sent alongwith the invalidation message - The MSP at home node examines this bit and
removes mispredicted messages.
23Performance Analysis
- Performance depends on
- Speculation accuracy
- Reduction in latency on successful speculation
- Misspeculation penalty
- Speculation opportunity A computationally
intensive application will benefit little from
speculation. - Assumptions
- When speculative memory request is successfully
executed, entire remote latency is hidden - Misspeculation only slows the remote access, does
not increase the request frequency
24Performance
- Performance Model
- c Applications communication ratio
- f fraction of speculatively executed
instructions over all - received requests
- p request prediction accuracy
- laccess local access latency
- raccess remote access latency
- rtl raccess /laccess
- n misspeculation penalty factor
- N number of remote requests on the critical path
25Performance
- Communication speedup is given by
- (Comm time w/o speculation)/(comm time w/
speculation) - Nraccess
- -------------------------------
------------------- - (1-f)Nraccess fN(placcess (1-p)nraccess)
- 1
- -------------------------------
------------------- - (1-f) f (p/rtl
n(1-p)) - Total speedup is given by
- (total execution time w/o speculation)/(total
execution time w/ speculation) - 1
- ----------------------------
--------------------- - (1-c)
c/(comm_speedup)
26Speedup vs various parameters
Potential Speedup in a speculative coherent DSM
27Speedups
- Prediction accuracy plays prominent role in
speedup - A low prediction accuracy of 10-50 results in
slowdown due to high speculation overhead while a
high prediction accuracy (90) increases speedup
even for moderate communication ratios. - At high prediction rates, slowdown due to
increasing misspeculation penalty is not
significant - f fraction of speculated instructions, is a
measure of number of request messages it takes to
learn and predict. For rapidly changing patterns,
even at high prediction accuracy, performance
improvement will not be significant. - Speculative coherent Protocol impacts clusters
most because of high rtl ratio.
28Simulation results
- Wisconsin wind tunnel II to simulate CC-Numa with
16 nodes interconnected through hardware DSM
boards to a low latency switched network. - Full map write invalidate protocol with 32 byte
coherence blocks. - Benchmarks appbt, barnes, em3d, moldyn, ocean,
tomcatv, unstructures.
29Results
Base predictor accuracy comparison(history depth
1)
30Results
- Em3d, Moldyn exhibit producer/consumer sharing
with small read sharing gt low impact of read
ordering gt high performance with MSP. - Unstructured exhibits wide read-sharing in
producer/consumer phase, hence MSP can get a
prediction accuracy of less that 65 while VMSP
can get almost 85.
31Results
Prediction accuracy with varying history depths
32Results
Messages predicted(correctly predicted) for a
history depth of 1
33Results
Predictor storage overhead
34Results
- All predictors use 4 bits to encode processor id
- Cosmos uses 3 bits to encode message type gt 7
bits for history table entry and 14 bit per pte
gt (714) bits per block - MSP and VMSP use 2 bits to encode a message type
- MSP 12 bits per pte gt(612) bits per block
- VMSP uses 18 bits per history table, but (186)
bits per pte gt (1824) bits per block (in VMSP a
read vector is always followed by a write/upgrade
and vice versa). A pte will contain at most one
entry. - MSP and VMSP require less storage compared to
cosmos.
35Summary and Conclusion
- Proposed the Memory Sharing Predictor tom predict
and execute coherence protocols speculatively. - MSP eliminates acknowledgement messages in
pattern tables and increases prediction accuracy
from 81 to 86. - VMSP further improves accuracy upto 93 using
compact vector representations and eliminating
perturbations due to read request reorderings. - VMSP also reduces implementation storage.
- High accuracy predictors are key to high
performance SC DSM.
36