Temporally Silent Stores Alternatively: Louder Silent Stores - PowerPoint PPT Presentation

1 / 22
About This Presentation
Title:

Temporally Silent Stores Alternatively: Louder Silent Stores

Description:

Characterizing Temporal Silence. Exploiting Temporal Silence Non-Speculatively with ... Broadcast when a processor detects the occurrence of temporal silence ... – PowerPoint PPT presentation

Number of Views:29
Avg rating:3.0/5.0
Slides: 23
Provided by: kevin127
Category:

less

Transcript and Presenter's Notes

Title: Temporally Silent Stores Alternatively: Louder Silent Stores


1
Temporally Silent Stores(Alternatively Louder
Silent Stores)
  • Kevin M. Lepak
  • Mikko H. Lipasti
  • University of WisconsinMadison
  • PHARM Team

http//www.ece.wisc.edu/pharm
2
Introduction
  • Many stores do not update system state
  • Silent Stores are writes which do not change
    the value at a memory location
  • Our own prior work has shown (some might say
    ad-nauseum) that silent stores are exploitable
  • Uniprocessors Memory write port reductions,
    reducing write-throughs, etc.
  • Multiprocessors Reducing sharing misses and
    invalidate traffic
  • Other researchers have found silent stores useful
  • Purser et. al. MICRO-2000, Yoaz et. al.
    HPCA-2001, Steffan et. al. HPCA-2002, others

3
What Is Temporal Silence?
  • The key idea behind silence is no observable
    change in system state
  • What if we change the state but then change it
    back?
  • Examples
  • Adding/removing items from a shared work stack
  • Flags indicating condition of a device/data
    structure
  • Lock variables (revert to unheld value when
    released)

4
Temporal Silence (TS) In MPs
Intermediate Value Store
0
1
Temporally Silent Pair
0
Reversion (TS) Store
X
Question Does CPU 1 need to re-read Addr A?
Answer No, old value is correct.
Can we exploit TS to eliminate this Read miss?
5
Outline
  • Introduction to Temporal Silence
  • Redefining Multiprocessor Sharing
  • Multiprocessor Limit Study
  • Characterizing Temporal Silence
  • Exploiting Temporal Silence Non-Speculatively
    with Coherence Support
  • Conclusions/Future Work

6
TSS MP Limit Study Setup
  • Temporal Silent Sharing (TSS)
  • How often is a given CPUs last fetched copy of a
    cache line current w.r.t. the global copy when
    accessed?
  • Indicates the potential reduction in data traffic
    by exploiting TS for shared cache lines
  • Infinite/finite caches, unified, 64B lines
  • Instant TS detection/propagation to remote
    processors
  • PowerPC, AIX v4.3.1
  • Scientific (SPLASH-2) and commercial workloads
  • SimOS-PPC full system simulator (4 CPUs)

7
MP Limit Study--Comm. Misses
Up to 45 reduction in communication misses for
TSS, 24/42 harmonic mean for scientific/commerci
al
8
MP Limit Study--Overall Miss Rate
Up to 33 reduction in miss rate for infinite
cache TSS, 15/25 harmonic mean for
scientific/commercial
9
Understanding TS/TSS
  • Examine the contribution to TSS by kernel,
    library, and user functions
  • Other benchmarks shown in paper
  • Scientific--substantial activity within kernel
  • Commercial--locking, JRE, process management

TSS Misses
TSS Stores
Function
Description
Example Spec-JBB
10
Understanding TSS
  • Values of intermediate and TS stores
  • Most TS store values are integer zero
  • Greater than 5 in TPC-W and Spec-WEB are
    non-null pointers
  • In many cases, intermediate value is not one
    (user-level spin-locks)
  • Thread IdsLarge fraction in commercial apps
  • Even in OCEANnot user-level spin locks (40)
  • Pointers to shared data structures (up to 40 in
    Spec-WEB)
  • Contribution by atomic primitives (lines touched
    with store-conditionals)

11
TSS--Atomic Primitives
Many True Sharing misses are due to atomic
primitives Exception Spec-JBB, 55 of TSM are
data
12
Attacking TSS Non-Speculatively
  • Idea
  • Memory values revert to previous values
  • Detect when they revert to some previous version
  • Communicate reversion/version to other CPUs
  • How can we win?
  • Improve remote read latency (cache misses -gt
    hits)
  • Communicate versions only (not values)

13
Detecting Temporal Silence
  • Inside the core
  • Augment LSQ to detect TS
  • Augment write buffer/write cache
  • Outside the core
  • Exploit inclusive memory hierarchies
  • Ex Modified L1 cache line has old version in L2
  • Augment L1/L2 with explicit storage

This talk assumes we have enough storage
to detect all cases of TSS which we can exploit
14
Exploiting Temporal Silence
  • Add coherence support
  • Add a temporally invalid (T) state to MESI
  • Entered upon receipt of an invalidate
  • Add a validate transaction
  • Takes remote lines T-gtS
  • Broadcast when a processor detects the occurrence
    of temporal silence
  • May lead to an increase in address traffic
  • New protocol -gt MESTI

15
MESTI Protocol Comm. Misses
MESTI exploits most TSS, reduction in comm.
misses 21/40 harmonic mean for
scientific/commercial
16
MESTI Address Traffic--Ideal
No address traffic increase with oracle predictor
of useful validates -gt validates prevent remote
read misses
17
MESTI Address Traffic--Measured
  • Actual increase up to 108 (infinite )
  • Decreases as cache size decreases
  • What causes useless validates?
  • No remote cpu has a copy of the cache line
  • Exploit snoop-aware validate
  • Detect if any remote cpu has a copy at
    intermediate value storeif not, avoid validate
  • Reduces useless validates from 0-20 for infinite
    caches, 7-50 for a finite 16MB cache

18
MESTI Address Traffic--Reduction
  • What causes useless validates?
  • A remote access does not occur before the line is
    re-written (the TS write is not the last write)
  • Place outbound validates into a delay queue
  • If a subsequent non-silent store occurs to the
    cache line, the validate is aborted
  • Filters many useless validates for lines
    re-written quickly
  • Affects timeliness of validates
  • Some TSS misses will not be avoided if cache line
    has returned to old value but validate is delayed

How do we determine effectiveness of such a queue?
19
MESTIDelay Queue Approach
Green of not last write TS stores detected in
this distance
Red of TS last writes exploited if propagated
within this distance
20
MESTIDelay Queue Approach
Short queue (27 cycles) removes 30-35 of useless
validates for Spec-JBB and TPC-W with 1
opportunity lost
21
Summary/Conclusions
  • Storage locations revert to previous values
  • We call stores writing such values temporally
    silent
  • Redefine MP sharing to consider this
  • Up to 45 of comm. misses eliminated
  • Characterization reveals insight at function
    level, and not all due to atomic primitives
  • Exploit non-speculatively with simple
    enhancements to the coherence protocol
  • Achieves the vast majority of possible benefit
  • Simple methods to reduce additional coherence txns

22
Current and Future Work
  • How do we approach the limit study results with
    realistic implementations?
  • Further limit unnecessary address traffic
  • Detail efficient ways of detecting TS in the
    memory hierarchy
  • Performance evaluation in OoO processor models,
    commercial workloads
  • Comparison with speculative methods which can
    capture TS
Write a Comment
User Comments (0)
About PowerShow.com