Title: Temporally Silent Stores Alternatively: Louder Silent Stores
1Temporally Silent Stores(Alternatively Louder
Silent Stores)
- Kevin M. Lepak
- Mikko H. Lipasti
- University of WisconsinMadison
- PHARM Team
http//www.ece.wisc.edu/pharm
2Introduction
- Many stores do not update system state
- Silent Stores are writes which do not change
the value at a memory location - Our own prior work has shown (some might say
ad-nauseum) that silent stores are exploitable - Uniprocessors Memory write port reductions,
reducing write-throughs, etc. - Multiprocessors Reducing sharing misses and
invalidate traffic - Other researchers have found silent stores useful
- Purser et. al. MICRO-2000, Yoaz et. al.
HPCA-2001, Steffan et. al. HPCA-2002, others
3What Is Temporal Silence?
- The key idea behind silence is no observable
change in system state - What if we change the state but then change it
back? - Examples
- Adding/removing items from a shared work stack
- Flags indicating condition of a device/data
structure - Lock variables (revert to unheld value when
released)
4Temporal Silence (TS) In MPs
Intermediate Value Store
0
1
Temporally Silent Pair
0
Reversion (TS) Store
X
Question Does CPU 1 need to re-read Addr A?
Answer No, old value is correct.
Can we exploit TS to eliminate this Read miss?
5Outline
- Introduction to Temporal Silence
- Redefining Multiprocessor Sharing
- Multiprocessor Limit Study
- Characterizing Temporal Silence
- Exploiting Temporal Silence Non-Speculatively
with Coherence Support - Conclusions/Future Work
6TSS MP Limit Study Setup
- Temporal Silent Sharing (TSS)
- How often is a given CPUs last fetched copy of a
cache line current w.r.t. the global copy when
accessed? - Indicates the potential reduction in data traffic
by exploiting TS for shared cache lines - Infinite/finite caches, unified, 64B lines
- Instant TS detection/propagation to remote
processors - PowerPC, AIX v4.3.1
- Scientific (SPLASH-2) and commercial workloads
- SimOS-PPC full system simulator (4 CPUs)
7MP Limit Study--Comm. Misses
Up to 45 reduction in communication misses for
TSS, 24/42 harmonic mean for scientific/commerci
al
8MP Limit Study--Overall Miss Rate
Up to 33 reduction in miss rate for infinite
cache TSS, 15/25 harmonic mean for
scientific/commercial
9Understanding TS/TSS
- Examine the contribution to TSS by kernel,
library, and user functions - Other benchmarks shown in paper
- Scientific--substantial activity within kernel
- Commercial--locking, JRE, process management
TSS Misses
TSS Stores
Function
Description
Example Spec-JBB
10Understanding TSS
- Values of intermediate and TS stores
- Most TS store values are integer zero
- Greater than 5 in TPC-W and Spec-WEB are
non-null pointers - In many cases, intermediate value is not one
(user-level spin-locks) - Thread IdsLarge fraction in commercial apps
- Even in OCEANnot user-level spin locks (40)
- Pointers to shared data structures (up to 40 in
Spec-WEB) - Contribution by atomic primitives (lines touched
with store-conditionals)
11TSS--Atomic Primitives
Many True Sharing misses are due to atomic
primitives Exception Spec-JBB, 55 of TSM are
data
12Attacking TSS Non-Speculatively
- Idea
- Memory values revert to previous values
- Detect when they revert to some previous version
- Communicate reversion/version to other CPUs
- How can we win?
- Improve remote read latency (cache misses -gt
hits) - Communicate versions only (not values)
13Detecting Temporal Silence
- Inside the core
- Augment LSQ to detect TS
- Augment write buffer/write cache
- Outside the core
- Exploit inclusive memory hierarchies
- Ex Modified L1 cache line has old version in L2
- Augment L1/L2 with explicit storage
This talk assumes we have enough storage
to detect all cases of TSS which we can exploit
14Exploiting Temporal Silence
- Add coherence support
- Add a temporally invalid (T) state to MESI
- Entered upon receipt of an invalidate
- Add a validate transaction
- Takes remote lines T-gtS
- Broadcast when a processor detects the occurrence
of temporal silence - May lead to an increase in address traffic
- New protocol -gt MESTI
15MESTI Protocol Comm. Misses
MESTI exploits most TSS, reduction in comm.
misses 21/40 harmonic mean for
scientific/commercial
16MESTI Address Traffic--Ideal
No address traffic increase with oracle predictor
of useful validates -gt validates prevent remote
read misses
17MESTI Address Traffic--Measured
- Actual increase up to 108 (infinite )
- Decreases as cache size decreases
- What causes useless validates?
- No remote cpu has a copy of the cache line
- Exploit snoop-aware validate
- Detect if any remote cpu has a copy at
intermediate value storeif not, avoid validate - Reduces useless validates from 0-20 for infinite
caches, 7-50 for a finite 16MB cache
18MESTI Address Traffic--Reduction
- What causes useless validates?
- A remote access does not occur before the line is
re-written (the TS write is not the last write) - Place outbound validates into a delay queue
- If a subsequent non-silent store occurs to the
cache line, the validate is aborted - Filters many useless validates for lines
re-written quickly - Affects timeliness of validates
- Some TSS misses will not be avoided if cache line
has returned to old value but validate is delayed
How do we determine effectiveness of such a queue?
19MESTIDelay Queue Approach
Green of not last write TS stores detected in
this distance
Red of TS last writes exploited if propagated
within this distance
20MESTIDelay Queue Approach
Short queue (27 cycles) removes 30-35 of useless
validates for Spec-JBB and TPC-W with 1
opportunity lost
21Summary/Conclusions
- Storage locations revert to previous values
- We call stores writing such values temporally
silent - Redefine MP sharing to consider this
- Up to 45 of comm. misses eliminated
- Characterization reveals insight at function
level, and not all due to atomic primitives - Exploit non-speculatively with simple
enhancements to the coherence protocol - Achieves the vast majority of possible benefit
- Simple methods to reduce additional coherence txns
22Current and Future Work
- How do we approach the limit study results with
realistic implementations? - Further limit unnecessary address traffic
- Detail efficient ways of detecting TS in the
memory hierarchy - Performance evaluation in OoO processor models,
commercial workloads - Comparison with speculative methods which can
capture TS