Memory Sharing Predictor: The key to speculative Coherent DSM - PowerPoint PPT Presentation

About This Presentation

Title:

Memory Sharing Predictor: The key to speculative Coherent DSM

Description:

Memory Sharing Predictor: The key to speculative Coherent DSM An-Chow Lai Babak Falsafi Purdue University Organization Introduction Directory based cache coherence ... – PowerPoint PPT presentation

Number of Views:113

Avg rating:3.0/5.0

Slides: 37

Provided by: Abhishe93

Learn more at: https://users.cs.utah.edu

Category:

more less

Transcript and Presenter's Notes

Title: Memory Sharing Predictor: The key to speculative Coherent DSM

1
Memory Sharing Predictor The key to speculative
Coherent DSM

An-Chow Lai
Babak Falsafi
Purdue University

2
Organization

Introduction
Directory based cache coherence
Pattern Based Message Predictors
Memory Sharing Predictors
Vector Memory Sharing predictors
Speculative Coherent operations
Performance Analysis
Results
Summary conclusions

3
Introduction

Distributed Shared Memory Multiprocessors
Provide a logical shared address space over
physically distributed memory
Programming easier compared to SMPs.
Non-Uniform Memory Access(Bottleneck) Remote
access far slower compared to local access.

DSM
4

Efforts to eliminate this difference
Custom designed motherboards cannot get benefit
of excellent cost-performance of off-shelf
motherboards
Reduce remote access frequency
Reduce coherence protocol overheadwill need
complex adaptive coherence protocols.
Existing predictorsdirected to specific sharing
patterns known a priori.
Pattern based predictors
Dynamically adapt to an applications sharing
pattern at runtime
Does not modify the base coherence protocol
Memory Sharing Predictors Vector Memory Sharing
Predictors
Topic of this paper
Improvement on general pattern based predictors
proposed by Mukherjee Hill

5
Directory based cache coherence
Processor Caches
Processor Caches
Processor Caches
Processor Caches
Memory
I/O
Memory
I/O
Memory
I/O
Memory
I/O
Directory
Directory
Directory
Directory
Interconnection network
6
Directory based cache coherence

Directory Based cache Coherence Protocols
Each node maintains sharing information of all
memory blocks
Based on a Finite state machine in which states
directory state
actions messages
This paper uses half migratory protocol
Speculative Coherent DSM must accurately predict
remote access and timely perform actions.

A remote read request
Directory protocol transitions
7
Pattern Based Message Predictors

Predicts the sender and type of next incoming
message for a particular block.
Structure Similar to a two level branch
predictor
History table captures most recent sequence of
incoming messages for every memory block
Pattern table records all observed sequences of
coherence messages for every memory block (An
Entry Sequence of messages prediction message)

A two level Message predictor
8
Pattern Based Message Predictors(contd.)

Depth of History Table Register number of past
messages, it keeps track of.
Deeper history depthgt more accurate prediction,
no race conditions.
Deeper history depth gt Large Pattern history
tablegt high cost.

9
Memory Sharing Predictors

Shortcomings of General Message Predictor
Invalidation messages may arrive in any order,
thus may interfere with prediction of more
necessary request messages
- It increases the number of pattern table
entries (almost doubles)
It increases the number of bits needed to encode
the messages (three requests two acks).
Observations
To eliminate the coherence overhead on remote
access, only necessary to predict memory request
messages (read ,write, upgrade).
Coherence acknowledgement message prediction
extra overhead as they are always expected to
arrive in response to a coherence action

10
Memory Sharing Predictors

MSP addresses these issues
predicting only the memory request messages
Since the acknowledgements are eliminated, all
the effects of possible reordering of
acknowledgements are eliminated.
Only 2 bits required to encode messages compared
to 3 for general predictor

11
VMSP A Vector MSP

Observations
Full map protocol allows multiple processors to
simultaneously cache read only copy of a memory
block.
A predictor must identify the sharers and not
maintain the order in which they are read.
Optimizations to MSP to get VMSP
Rather than record and predict read requests as
individual pattern table entries, encode a
sequence of read requests as a bit vector just
like the directory maintains the list of sharers.

12
Vector Memory Sharing Predictor(contd.)

Benefits
reduces the number of pattern table entries
eliminates the effect of re-ordering of reads on
size
Effect on history depth number of sharers
Good when the number of readers are
large(gt(2n)/2log(n)).

13
Triggering Request Speculation

Important considerations
Predict what remote memory requests arrive
Predict when remote accesses arrive
Execute necessary coherence actions

A speculative coherent DSM node and coherence
hardware
14
Triggering Request Speculation

A) What remote memory request arrives somewhat
simple from pattern history table (which stores
what memory accesses take place)
B) When somewhat tough here
early speculation may take away block from its
readers
Late speculation may incur additional delay and
may limit DSMs ability to hide coherence
overhead
was not a problem in COSMOS as all the coherence
messages were being predicted but not sent. They
were sent only after the previous message
arrived. Since there are no coherence
acknowledgement messages in the history table so
timing is a problem now.

15
Triggering Request Speculation

Two ways to overcome
Speculative Write Invalidation
Based on common memory access patterns most
producer consumer scenario Producer writes to a
memory block and then no longer accesses until it
has been read by consumers. Common in parallel
commercial data base servers.
MSP predicts that a processor is done writing
when the processor writes to some other memory
location
Maintain a early write-invalidate table stores
last address written by a processor.
If address in EWI table changes, trigger
speculative write invalidate and subsequent
reads.

16
Comparison with general Message predictor
P1 reader
P3 Directory
P2 Writer
P1 reader
P3 Directory
P2 writer
Time
Time
Write A
Read
invalidate
Write B
Invalidate
Writeback
Send block
Writeback
Prefetching starts
Send block
Read hit
17
Question?

What happens if while speculatively read data has
been sent by P3 to P1, P1 has already made the
request for data?

18
Question?

What happens if while speculatively read data has
been sent by P3 to P1, P1 has already made the
request for data?
--The DSM node on receiving that speculated
message drops this message to avoid modifying the
protocol.

19
Question?

What happens if P1 makes read request before P2
does the second write?

20
Question?

What happens if P1 makes read request before P2
does the second write?
First Read Protocol
2) First Read
If SWI fails, then on the first read request
made, all subsequent reads are triggered.

21
Speculative Coherence Operations

Final Action
execute a coherence action speculatively
verify the accuracy of the predictor
Requirements
Co-exist with the base coherence protocol without
any protocol modifications
MSP simply advices the protocol to execute
coherence operations. Any misspeculation results
in additional coherence operations but no
interference with protocol functionality
eg. A premature write invalidation results in
additional read /write request by producer.
MSP will advice the protocol to send read-only
block copies to requesters.

22
Verification of accuracy

Reference bit in remote cache of every block
placed speculatively
On actual reference, remote cache clears the bit,
verifying that the access occurred.
On invalidation of this block, reference bit is
sent alongwith the invalidation message
The MSP at home node examines this bit and
removes mispredicted messages.

23
Performance Analysis

Performance depends on
Speculation accuracy
Reduction in latency on successful speculation
Misspeculation penalty
Speculation opportunity A computationally
intensive application will benefit little from
speculation.
Assumptions
When speculative memory request is successfully
executed, entire remote latency is hidden
Misspeculation only slows the remote access, does
not increase the request frequency

24
Performance

Performance Model
c Applications communication ratio
f fraction of speculatively executed
instructions over all
received requests
p request prediction accuracy
laccess local access latency
raccess remote access latency
rtl raccess /laccess
n misspeculation penalty factor
N number of remote requests on the critical path

25
Performance

Communication speedup is given by
(Comm time w/o speculation)/(comm time w/
speculation)
Nraccess
-------------------------------
-------------------
(1-f)Nraccess fN(placcess (1-p)nraccess)
1
-------------------------------
-------------------
(1-f) f (p/rtl
n(1-p))
Total speedup is given by
(total execution time w/o speculation)/(total
execution time w/ speculation)
1
----------------------------
---------------------
(1-c)
c/(comm_speedup)

26
Speedup vs various parameters
Potential Speedup in a speculative coherent DSM
27
Speedups

Prediction accuracy plays prominent role in
speedup
A low prediction accuracy of 10-50 results in
slowdown due to high speculation overhead while a
high prediction accuracy (90) increases speedup
even for moderate communication ratios.
At high prediction rates, slowdown due to
increasing misspeculation penalty is not
significant
f fraction of speculated instructions, is a
measure of number of request messages it takes to
learn and predict. For rapidly changing patterns,
even at high prediction accuracy, performance
improvement will not be significant.
Speculative coherent Protocol impacts clusters
most because of high rtl ratio.

28
Simulation results

Wisconsin wind tunnel II to simulate CC-Numa with
16 nodes interconnected through hardware DSM
boards to a low latency switched network.
Full map write invalidate protocol with 32 byte
coherence blocks.
Benchmarks appbt, barnes, em3d, moldyn, ocean,
tomcatv, unstructures.

29
Results
Base predictor accuracy comparison(history depth
1)
30
Results

Em3d, Moldyn exhibit producer/consumer sharing
with small read sharing gt low impact of read
ordering gt high performance with MSP.
Unstructured exhibits wide read-sharing in
producer/consumer phase, hence MSP can get a
prediction accuracy of less that 65 while VMSP
can get almost 85.

31
Results
Prediction accuracy with varying history depths
32
Results
Messages predicted(correctly predicted) for a
history depth of 1
33
Results
Predictor storage overhead
34
Results

All predictors use 4 bits to encode processor id
Cosmos uses 3 bits to encode message type gt 7
bits for history table entry and 14 bit per pte
gt (714) bits per block
MSP and VMSP use 2 bits to encode a message type
MSP 12 bits per pte gt(612) bits per block
VMSP uses 18 bits per history table, but (186)
bits per pte gt (1824) bits per block (in VMSP a
read vector is always followed by a write/upgrade
and vice versa). A pte will contain at most one
entry.
MSP and VMSP require less storage compared to
cosmos.

35
Summary and Conclusion

Proposed the Memory Sharing Predictor tom predict
and execute coherence protocols speculatively.
MSP eliminates acknowledgement messages in
pattern tables and increases prediction accuracy
from 81 to 86.
VMSP further improves accuracy upto 93 using
compact vector representations and eliminating
perturbations due to read request reorderings.
VMSP also reduces implementation storage.
High accuracy predictors are key to high
performance SC DSM.