Effective and Inexpensive (Memory) Race Recording - PowerPoint PPT Presentation

About This Presentation
Title:

Effective and Inexpensive (Memory) Race Recording

Description:

Title: A Serializability Violation Detector for Shared-Memory Server Programs Author: Min Xu Last modified by: owen Created Date: 4/22/2005 3:03:52 PM – PowerPoint PPT presentation

Number of Views:64
Avg rating:3.0/5.0
Slides: 74
Provided by: MinX150
Category:

less

Transcript and Presenter's Notes

Title: Effective and Inexpensive (Memory) Race Recording


1
Effective and Inexpensive(Memory) Race Recording
  • Min Xu
  • Thesis Defense
  • 05/04/2006
  • Electrical and Computer Engineering Department,
    UW-Madison
  • Advisors Mark Hill, Rastislav Bodik
  • Committee Remzi Arpaci-Dusseau, Mikko Lipasti,
    Barton Miller, David Wood

2
Overview
  • Increasingly useful to replay multithreaded code
  • Race recording key to dealing with
    nondeterminism
  • A Case Study
  • Long recording 1 byte/kilo-instr
  • Always-on recording less than 2 overhead
  • Low cost 24 KB RAM/core
  • Support both SC TSO (x86-like)

3
Thesis Contributions
Low Runtime Overhead
Small Log Size
RTR Algorithm
Coherence Piggyback
Effective
Inexpensive
Order-Value Hybrid
Set/LRU Approximation
Low Cost Hardware
SC TSO Applicability
4
Outline
5 slides
Motivation Problem
21
An Effective and Inexpensive Race Recorder
RTR Algorithm
Set/LRU Approximation
Coherence Piggyback
Order-Value Hybrid
6
Evaluation Method Results
3
Conclusion My Other Research
5
Motivation Problem
6
Multithreaded Debugging
  • gcc hash.c
  • a.out
  • Segmentation fault

gdb a.out gdbgt run Program received SIGSEGV. In
get() at hash.c45 45 a bucket-gtd
gdb a.out gdbgt run Program exited normally. gdbgt
gcc para-hash.c a.out Segmentation fault
gdb a.out log gdbgt run Program received
SIGSEGV. In get() at para-hash.c67 67 a
bucket-gtd
gcc para-hash.c a.out Segmentation fault Race
recorded in log
7
Race Recording
Thread I
Thread J
Thread I
Thread J
X 1 X print(X)
- - - X X5 -
X 1 X print(X)
- X X5 - -
Original
Replay
X6
X10
8
Recording for Multithreaded Replay
  • Race Recording
  • Not-an-issue for a single thread
  • Create the same general data races
  • Checkpointing
  • Provide a snapshot of the program state
  • Many proposals (e.g., SafetyNet), not focus
  • Input Recording
  • Provide repeatable inputs
  • Some proposals (e.g., part of FDR), not focus

9
A Good Race Recorder
Low runtime overhead
Applicability
Low cost
gdb a.out log gdbgt run Program received
SIGSEGV. In get() at para-hash.c67 67 a
bucket-gtd
gcc para-hash.c a.out Segmentation fault Race
recorded in log
Long recording small log
10
Desired Existing Race Recorders
Recording Length Applicability Applicability Applicability Overhead Cost
Desired Recorder Small Log Size MP Racey Code SC TSO Negligible Slowdown Little Hardware
InstRply 87
RC 90
Bacon91
Netzer93
Déjà Vu 98
RecPlay 00 JaRec 04

11
Small Log Size
RTR Algorithm
Coherence Piggyback
Order-Value Hybrid
Set/LRU Approximation
12
Problem Formulation
Dependence (black)
Conflicts (red)
Thread I
Thread J
Thread I
Thread J
ld A
add
ld A
add
st B
st B
st C
st C
st C
Log
st C
ld B
ld B
ld D
st A
ld D
st A
sub
sub
st C
st C
ld B
ld B
st D
st D
Recording
Replay
Reproduce exact same conflicts no more, no less
13
Log All Conflicts
Thread I
Thread J
ld A
add
st B
st C
st C
ld B
st A
ld D
sub
st C
ld B
st D
Replay
  • ? Detect conflicts ? Write log

Assign IC (logical Timestamps)
But too many conflicts
14
Netzers Transitive Reduction
Thread I
Thread J
TR reduced
1
1
ld A
add
st B
st C
2
2
st C
ld B
3
3
st A
ld D
4
4
sub
st C
5
5
ld B
st D
6
6
Replay
15
The Intuition of the New RTR Algorithm
After Reduction
16
Stricter Dependences to Aid Vectorization
Thread I
Thread J
1
1
ld A
add
st B
st C
2
2
st C
ld B
3
3
st A
ld D
4
4
sub
st C
5
5
ld B
st D
6
6
Replay
17
Compress Vectorized Dependencies
Thread I
Thread J
1
1
ld A
add
st B
st C
2
2
st C
ld B
3
3
st A
ld D
4
4
sub
st C
5
5
ld B
st D
6
6
Replay
Reduce log size to KB/core/second
18
Low Runtime Overhead
RTR Algorithm
Coherence Piggyback
Order-Value Hybrid
Set/LRU Approximation
19
Detect Conflicts
A.readers A.writer
Thread I
Thread J
A.readers.add(I, 1)
1
1
ld A
add
st B
st C
2
2
st C
ld B
3
3
st A
4
Recording
Expensive in software
20
Use Cache and Cache Coherence
Proc I
Proc J
ld B
Tag State Data Timestamp A S 1 B M
4
Tag State Data Timestamp A S 3 B I
2
RAW Detected Logged
Detect conflict in hardware with little runtime
cost
21
Cache Evictions and Writebacks
Proc I
Proc J
st A
Tag State Data Timestamp A S 1 B M
4
Tag State Data Timestamp A S 3 B I
2
M 4
C M 3
WAR Detected Logged
Directory of A Shared(I,J) Owner()
OK with nonsilent eviction directory eviction
22
Implement TR and RTR in Hardware
  • Ideal TR requires vector timestamps
  • Too expensive
  • New idea Pairwise-TR (use scalar timestamp)
  • Enable pairwise transitive reduction
  • Optimal RTR algorithm is likely expensive
  • Implement a greedy RTR algorithm
  • One-pass, online algorithm
  • Keep a sliding window of vectorizable dependencies

23
Hardware Implementation
Cache Cache
Eviction/writeback Solved, more details later
Directory protocols Solved
Snooping protocols Partly solved
Two-level coherence Not yet solved
Processor Processor
Out-of-order/Prefetching Solved
Unordered message Solved
Counter overflow Solved
Thread Migration Not yet solved
24
RTR Algorithm
Coherence Piggyback
Low Cost Hardware
Order-Value Hybrid
Set/LRU Approximation
25
Timestamp Approximation
Thread I
Thread J
1
1
ld A
add
One Set of Is
Tag State Data Timestamp A S 1 B M
2
st B
st C
2
2
st C
ld B
3
3
Use current IC of thread I
st A
I
ld D
J
Recording
Directory of A Shared(I)
Correct, but more evictions ? more logged
conflicts
26
Hardware Cost
Log Size
27
Set/LRU Approximation
Set/LRU better preserve reducibility Small ?
more misses ? but still small log
28
Hardware Cost of Timestamps
Coupled Timestamp Memory
Tag State Data Timestamp A S 1 B M
2
  • Coupled timestamp memory overhead ? cache size
  • Not flexible
  • 64B line 64b (24b) timestamp ? 12.5 (4.7)
    overhead
  • 192 KB for a 4MB L2
  • Need to modify cache

29
Decoupled Timestamp Memory
Coupled Timestamp Memory
Tag State Data Timestamp A S 1 B M
2
  • Decoupling ? Small timestamp memory (Set/LRU)
  • e.g., 32-set, 64-way ? 99 transitive reduction
  • Timestamps Memory ? 24 KB
  • No need to modify cache

From 192 KB to 24 KB 8x reduction
30
RTR Algorithm
Coherence Piggyback
Order-Value Hybrid
Set/LRU Approximation
SC TSO Applicability
31
Recording with Total Store Order (TSO)
  • Majority of existing MP are non-SC
  • TSO is well defined, x86-like

32
TSO Execution
I
J
A1
B1
st A,1
st B,1
Thread I
Thread J
WrBuf
WrBuf
ld A
AB0
ld B
1
1
st A,1
st B,1
st A,1
ld B
ld A
2
2
st B,1
Memory System
A0
B0
A0
B0
33
Order-Value-Hybrid Recording
WAR Omitted
st A,1
Value Logged
Thread I
Thread J
I
J
A1
B1
st B,1
AB0
ld A
1
1
st A,1
st B,1
WrBuf
WrBuf
ld B
ld B
ld A
st A,1
2
2
Recording
st B,1
Memory System
A Changed!
A0
B0
A0
B0
Start Monitor A
Start Monitor B
Stop Monitor B
34
Hybrid Recording with TR and RTR
  • Hybrid recording
  • All loads get correct values
  • Hardware similar to OoO SC Gharachorloo et al.
    91
  • Hybrid TR RTR
  • TR will not use the omitted WAR in reduction
  • RTR vectorize dependencies more conservatively

35
Evaluation Method Results
36
Put-it-together Determinizer/CMP
37
Simulation Method
  • Commercial server hardware
  • GEMS http//www.cs.wisc.edu/gems
  • Full-system (OS application) executions
  • 4-core CMP (Sequential Consistent)
  • 1-way in-order issue, 2 GHz,
  • 64KB I/D L1, 4MB L2, 64byte lines, MOSI directory
  • Commercial server software
  • Apache static web serving
  • SpecJBB middleware
  • OLTP TPC-C like
  • Zeus static web serving

38
Log Size 1 byte/kilo-instr
  • Well within in the capability of current machines
  • Long recording (days months) need improvement

39
Runtime Overhead
Interconnection Msg. B/W
Our recorder can be always-on
40
Benefits of RTR and Set/LRU (Log Size)
Improvement by RTR
Effectiveness of Set/LRU
Log Size
Log Size
Pairwise-TR
Our RTR
41
Why RTR and Set/LRU Work Well?
  • RTR
  • Processors execute instructions at similar speed
  • Therefore, we can find vectorizable
    dependencies
  • Set/LRU
  • Temporal locality makes the LRU timestamps old
  • We only need to know if a timestamp is
    old-enough

42
Sensitivity and Scalability
  • A design space of the timestamp memory (TSM)
  • Size smaller TSM -gt larger log
  • Read/write timestamp should be used when TSM is
    large
  • Partial timestamp 24-bit enough
  • Associativity higher better for RTR
  • Scalability of the recorder
  • Studied with modest processors (2p 16p)
  • Commercial workloads, not scientific workloads
  • Log size increase slowly with number of cores

43
Conclusion My Other Research
44
Race Recording
  • Race recording ? Key to combat nondeterminism
  • My thesis ? An effective inexpensive Recorder
  • RTR algorithm ? small log size
  • Coherence piggyback? Negligible slowdown
  • Timestamp approximation ? Low hardware cost
  • Order-value hybrid ? support SC TSO
  • Future work
  • Improve race recording algorithm
  • Improve race recorder implementation
  • Study race replay

45
Serializability Violation Detector PLDI05
  • Like a race detector
  • No a priori annotation requirement
  • critical sections are inferred
  • Intend to detect bugs actually happen
  • Check for a 2-Phase-Locking condition

Read in1
Read local
Write out1
Write local
Read in2
Write out2
46
Publications
  • FDR (ISCA03)
  • Adopted by UCSD BugNet (ISCA05)
  • SVD (PLDI05)
  • Cited by Vaziri et al. (POPL06)
  • Influenced new data race definition
  • RTR, Set/LRU Hybrid
  • Submitted for publication

47
Thank you!
gdb a.out log gdbgt run Program received
SIGSEGV. In get() at para-hash.c67 67 a
bucket-gtd
gcc para-hash.c a.out Segmentation fault Race
recorded in log
48
Acknowledgements
  • Joint work with my advisors
  • Mark Hill, Ras Bodik
  • Ph.D. Committee
  • David Wood, Mikko Lipasti, Remzi Arpaci-Dusseau,
    Barton Miller
  • Multifacet Group
  • Milo Martin, Dan Sorin, Carl Mauer, Brad
    Beckmann, Kevin Moore, Alaa Alameldeen, Mike
    Marty, Luke Yen
  • Affiliates Companies
  • Joe Emer, CJ Newburn, Peter Hsu, Bob Zak, Eric
    Bach, Gang Luo, Alex Chow, IBM, Intel, Microsoft,
    Sun

49
Deterministic Replay is Useful
  • Deterministic Replay is logically recreating a
    program execution
  • Present applications
  • Cyclic Debugging (Pancake Netzer 93)
  • Fault Tolerance (ExtraVirt Lucchetti et al.
    05)
  • Intrusion Analysis (ReVirt Dunlap et al. 02)
  • Future applications
  • Data Recovery
  • Replay-based Synchronization

50
Multicore and Multithreading
  • Multicore is common
  • AMD X2
  • IBM Power 5/6, Cell
  • Intel Pentium D, Core Duo
  • Sun SPARC T1
  • Multithreading is common
  • Server high throughput
  • Scientific high performance
  • Desktop/embedded low response time

51
Race Recording Key to Determinism
  • Races general race data race Netzer Miller
  • Both cause nondeterminism
  • Race recording can help, but
  • Existing race recorders are inadequate
  • Some generate large logs
  • Some have high runtime overhead
  • Some have high hardware cost (space overhead)
  • Support only sequential consistency

Need a better race recorder
52
Recording/Replay Debugging
  • Online Recorder

Crash
Dump Core
Deterministic Replayer
Crash
53
Deterministic Replay Fault Tolerance
  • Fault Recovery
  • Replay after a failure
  • Fault Detection
  • Replay then compare

54
Future Record/Replay Undo/Redo
  • VM as a software platform
  • Ease software development
  • Fine granularity in Undo and Redo

55
Future Replay-based Synchronization
  • Three steps
  • Coarse-grain sync. ? fine-grain sync. ? hardware
    sync.
  • Results higher performance
  • Works only if static control flow fixed data
    addr
  • DSP kernels

56
Race Recording Related Work
Total-order recorders Total-order recorders Total-order recorders Partial-order recorders Partial-order recorders Partial-order recorders
Bacon 91 (Hardware) RecPlay 00 JaRec 04 RC90 Déjà Vu 98 Bacon 91 (Hardware) Instant Replay 87 Netzer 93
Bus transactions Lamport Clocks Scheduling Bus transaction groups Variable version Vector clocks
Large log Small log Small log Large log Large log Small log
Low overhead Low overhead (sync only) Low overhead (non-MP) Low overhead High overhead High overhead
Low replay parallelism Low replay parallelism Low replay parallelism High replay parallelism High replay parallelism High replay parallelism
57
Correctness of Order-Value-Hybrid
  • Removing WAR dependencies
  • Say thread I read, thread J write
  • Removing the WAR affects Is read, not Js write
  • But, for every dependence removed, thread I reads
    correct value from the value log
  • Therefore, all reads get the correct value

58
TR and TSO
  • TR affects dependencies reduced by a WAR
  • The WAR itself may later be removed during replay
  • Solution Not use WAR in TR if the WAR can be
    removed
  • Respond with a special flag when a loaded cache
    line is stolen

Thread I
Thread J
1
1
st A
st B
st C
st C
2
2
ld B
ld A
3
3
Recording
Must not be reduced
59
RTR and TSO
  • The sliding window may expose the ordered loads
  • Shrink the sliding window to avoid it

Thread I
Thread J
ordered
1
1
st A
add
new win for j3
add
sub
old win for j3
2
2
in write bufffer
st B
ld A
3
3
ordered
ld C
ld B
4
4
Recording
Not allowed by new window
60
Deadlock Avoidance of RTR
Thread I
Thread J
1
1
ld A
add
st B
st C
2
2
st C
ld B
3
3
st A
ld D
4
4
sub
st C
5
5
ld B
st D
6
6
Recording
Avoid deadlock by adhere to a SC total order
61
Recording Race-free Executions
  • No data races
  • Only need to record synchronization race
  • Deterministic replay up until the first data race

62
Replay Parallelism
  • Replay performance depends on
  • Number of synchronizations
  • Extra wait incurred by the synchronizations

63
Directory Protocols
  • Add sticky states in the directory
  • Retain states after writebacks
  • Need extra acknowledgements
  • Or, add extra timestamp memory in the directory
  • Helps to avoid extra acknowledgements
  • A tradeoff
  • Sticky states can be cheaper
  • But extra timestamp memory can be faster

64
Snooping Protocols
  • Key problem is combined/implicit response
  • Not a problem for AMD Hammer

Proc I
Proc J
st A
Tag State Data Timestamp A S 1 B M
4
Tag State Data Timestamp A S 3 B I
2
Current IC
WAR Detected Logged
65
Nonsilent Evictions
Proc I
Proc J
st A
Tag State Data Timestamp A S 1 B M
4
Tag State Data Timestamp A S 3 B I
2
M 4
C M 3
Directory of A Shared(J) Owner() StickyS(I,J)
Directory eviction more false conflict, like
snooping
66
Out-of-Order Hardware Prefetching
  • Speculative execution
  • No IC assigned yet
  • Hardware prefetching
  • No IC assigned
  • Key idea receive observation
  • Can associate a ld/st with current commit
    instruction

67
Unordered Messages in Interconnect
  • Message arrive out-of-order
  • Can affect reduction
  • But better add a sequence number
  • Reconstruct the message order
  • Enable IC compression by sending deltas

68
Integer Overflow
  • IC and timestamps may overflow
  • IC make it 64bit, will not overflow for a long
    time
  • Timestamps use approximation techniques
  • MSB of IC LSB of Timestamps

69
Varying TSM Size
70
Varying Associativity
71
Varying Partial Timestamp Width
72
Log Size Scaling
73
In Retrospect
  • What are you most proud of?
  • RTR improves TR after 13 years
  • What would you do differently if doing it again?
  • replaying me is deterministic (just kidding)
  • I wish I focused on race recording earlier
  • What the industry should do?
  • Implement the recorder as a VMM extension
Write a Comment
User Comments (0)
About PowerShow.com