Effective and Inexpensive (Memory) Race Recording - PowerPoint PPT Presentation

About This Presentation

Title:

Effective and Inexpensive (Memory) Race Recording

Description:

Title: A Serializability Violation Detector for Shared-Memory Server Programs Author: Min Xu Last modified by: owen Created Date: 4/22/2005 3:03:52 PM – PowerPoint PPT presentation

Number of Views:64

Avg rating:3.0/5.0

Slides: 74

Provided by: MinX150

Learn more at: https://research.cs.wisc.edu

Category:

more less

Transcript and Presenter's Notes

Title: Effective and Inexpensive (Memory) Race Recording

1
Effective and Inexpensive(Memory) Race Recording

Min Xu
Thesis Defense
05/04/2006
Electrical and Computer Engineering Department,
UW-Madison
Advisors Mark Hill, Rastislav Bodik
Committee Remzi Arpaci-Dusseau, Mikko Lipasti,
Barton Miller, David Wood

2
Overview

Increasingly useful to replay multithreaded code
Race recording key to dealing with
nondeterminism
A Case Study
Long recording 1 byte/kilo-instr
Always-on recording less than 2 overhead
Low cost 24 KB RAM/core
Support both SC TSO (x86-like)

3
Thesis Contributions
Low Runtime Overhead
Small Log Size
RTR Algorithm
Coherence Piggyback
Effective
Inexpensive
Order-Value Hybrid
Set/LRU Approximation
Low Cost Hardware
SC TSO Applicability
4
Outline
5 slides
Motivation Problem
21
An Effective and Inexpensive Race Recorder
RTR Algorithm
Set/LRU Approximation
Coherence Piggyback
Order-Value Hybrid
6
Evaluation Method Results
3
Conclusion My Other Research
5
Motivation Problem
6
Multithreaded Debugging

gcc hash.c
a.out
Segmentation fault

gdb a.out gdbgt run Program received SIGSEGV. In
get() at hash.c45 45 a bucket-gtd
gdb a.out gdbgt run Program exited normally. gdbgt
gcc para-hash.c a.out Segmentation fault
gdb a.out log gdbgt run Program received
SIGSEGV. In get() at para-hash.c67 67 a
bucket-gtd
gcc para-hash.c a.out Segmentation fault Race
recorded in log
7
Race Recording
Thread I
Thread J
Thread I
Thread J
X 1 X print(X)
- - - X X5 -
X 1 X print(X)
- X X5 - -
Original
Replay
X6
X10
8
Recording for Multithreaded Replay

Race Recording
Not-an-issue for a single thread
Create the same general data races
Checkpointing
Provide a snapshot of the program state
Many proposals (e.g., SafetyNet), not focus
Input Recording
Provide repeatable inputs
Some proposals (e.g., part of FDR), not focus

9
A Good Race Recorder
Low runtime overhead
Applicability
Low cost
gdb a.out log gdbgt run Program received
SIGSEGV. In get() at para-hash.c67 67 a
bucket-gtd
gcc para-hash.c a.out Segmentation fault Race
recorded in log
Long recording small log
10
Desired Existing Race Recorders
Recording Length Applicability Applicability Applicability Overhead Cost
Desired Recorder Small Log Size MP Racey Code SC TSO Negligible Slowdown Little Hardware
InstRply 87
RC 90
Bacon91
Netzer93
Déjà Vu 98
RecPlay 00 JaRec 04

11
Small Log Size
RTR Algorithm
Coherence Piggyback
Order-Value Hybrid
Set/LRU Approximation
12
Problem Formulation
Dependence (black)
Conflicts (red)
Thread I
Thread J
Thread I
Thread J
ld A
add
ld A
add
st B
st B
st C
st C
st C
Log
st C
ld B
ld B
ld D
st A
ld D
st A
sub
sub
st C
st C
ld B
ld B
st D
st D
Recording
Replay
Reproduce exact same conflicts no more, no less
13
Log All Conflicts
Thread I
Thread J
ld A
add
st B
st C
st C
ld B
st A
ld D
sub
st C
ld B
st D
Replay

? Detect conflicts ? Write log

Assign IC (logical Timestamps)
But too many conflicts
14
Netzers Transitive Reduction
Thread I
Thread J
TR reduced
1
1
ld A
add
st B
st C
2
2
st C
ld B
3
3
st A
ld D
4
4
sub
st C
5
5
ld B
st D
6
6
Replay
15
The Intuition of the New RTR Algorithm
After Reduction
16
Stricter Dependences to Aid Vectorization
Thread I
Thread J
1
1
ld A
add
st B
st C
2
2
st C
ld B
3
3
st A
ld D
4
4
sub
st C
5
5
ld B
st D
6
6
Replay
17
Compress Vectorized Dependencies
Thread I
Thread J
1
1
ld A
add
st B
st C
2
2
st C
ld B
3
3
st A
ld D
4
4
sub
st C
5
5
ld B
st D
6
6
Replay
Reduce log size to KB/core/second
18
Low Runtime Overhead
RTR Algorithm
Coherence Piggyback
Order-Value Hybrid
Set/LRU Approximation
19
Detect Conflicts
A.readers A.writer
Thread I
Thread J
A.readers.add(I, 1)
1
1
ld A
add
st B
st C
2
2
st C
ld B
3
3
st A
4
Recording
Expensive in software
20
Use Cache and Cache Coherence
Proc I
Proc J
ld B
Tag State Data Timestamp A S 1 B M
4
Tag State Data Timestamp A S 3 B I
2
RAW Detected Logged
Detect conflict in hardware with little runtime
cost
21
Cache Evictions and Writebacks
Proc I
Proc J
st A
Tag State Data Timestamp A S 1 B M
4
Tag State Data Timestamp A S 3 B I
2
M 4
C M 3
WAR Detected Logged
Directory of A Shared(I,J) Owner()
OK with nonsilent eviction directory eviction
22
Implement TR and RTR in Hardware

Ideal TR requires vector timestamps
Too expensive
New idea Pairwise-TR (use scalar timestamp)
Enable pairwise transitive reduction
Optimal RTR algorithm is likely expensive
Implement a greedy RTR algorithm
One-pass, online algorithm
Keep a sliding window of vectorizable dependencies

23
Hardware Implementation
Cache Cache
Eviction/writeback Solved, more details later
Directory protocols Solved
Snooping protocols Partly solved
Two-level coherence Not yet solved
Processor Processor
Out-of-order/Prefetching Solved
Unordered message Solved
Counter overflow Solved
Thread Migration Not yet solved
24
RTR Algorithm
Coherence Piggyback
Low Cost Hardware
Order-Value Hybrid
Set/LRU Approximation
25
Timestamp Approximation
Thread I
Thread J
1
1
ld A
add
One Set of Is
Tag State Data Timestamp A S 1 B M
2
st B
st C
2
2
st C
ld B
3
3
Use current IC of thread I
st A
I
ld D
J
Recording
Directory of A Shared(I)
Correct, but more evictions ? more logged
conflicts
26
Hardware Cost
Log Size
27
Set/LRU Approximation
Set/LRU better preserve reducibility Small ?
more misses ? but still small log
28
Hardware Cost of Timestamps
Coupled Timestamp Memory
Tag State Data Timestamp A S 1 B M
2

Coupled timestamp memory overhead ? cache size
Not flexible
64B line 64b (24b) timestamp ? 12.5 (4.7)
overhead
192 KB for a 4MB L2
Need to modify cache

29
Decoupled Timestamp Memory
Coupled Timestamp Memory
Tag State Data Timestamp A S 1 B M
2

Decoupling ? Small timestamp memory (Set/LRU)
e.g., 32-set, 64-way ? 99 transitive reduction
Timestamps Memory ? 24 KB
No need to modify cache

From 192 KB to 24 KB 8x reduction
30
RTR Algorithm
Coherence Piggyback
Order-Value Hybrid
Set/LRU Approximation
SC TSO Applicability
31
Recording with Total Store Order (TSO)

Majority of existing MP are non-SC
TSO is well defined, x86-like

32
TSO Execution
I
J
A1
B1
st A,1
st B,1
Thread I
Thread J
WrBuf
WrBuf
ld A
AB0
ld B
1
1
st A,1
st B,1
st A,1
ld B
ld A
2
2
st B,1
Memory System
A0
B0
A0
B0
33
Order-Value-Hybrid Recording
WAR Omitted
st A,1
Value Logged
Thread I
Thread J
I
J
A1
B1
st B,1
AB0
ld A
1
1
st A,1
st B,1
WrBuf
WrBuf
ld B
ld B
ld A
st A,1
2
2
Recording
st B,1
Memory System
A Changed!
A0
B0
A0
B0
Start Monitor A
Start Monitor B
Stop Monitor B
34
Hybrid Recording with TR and RTR

Hybrid recording
All loads get correct values
Hardware similar to OoO SC Gharachorloo et al.
91
Hybrid TR RTR
TR will not use the omitted WAR in reduction
RTR vectorize dependencies more conservatively

35
Evaluation Method Results
36
Put-it-together Determinizer/CMP
37
Simulation Method

Commercial server hardware
GEMS http//www.cs.wisc.edu/gems
Full-system (OS application) executions
4-core CMP (Sequential Consistent)
1-way in-order issue, 2 GHz,
64KB I/D L1, 4MB L2, 64byte lines, MOSI directory

Commercial server software
Apache static web serving
SpecJBB middleware
OLTP TPC-C like
Zeus static web serving

38
Log Size 1 byte/kilo-instr

Well within in the capability of current machines
Long recording (days months) need improvement

39
Runtime Overhead
Interconnection Msg. B/W
Our recorder can be always-on
40
Benefits of RTR and Set/LRU (Log Size)
Improvement by RTR
Effectiveness of Set/LRU
Log Size
Log Size
Pairwise-TR
Our RTR
41
Why RTR and Set/LRU Work Well?

RTR
Processors execute instructions at similar speed
Therefore, we can find vectorizable
dependencies
Set/LRU
Temporal locality makes the LRU timestamps old
We only need to know if a timestamp is
old-enough

42
Sensitivity and Scalability

A design space of the timestamp memory (TSM)
Size smaller TSM -gt larger log
Read/write timestamp should be used when TSM is
large
Partial timestamp 24-bit enough
Associativity higher better for RTR
Scalability of the recorder
Studied with modest processors (2p 16p)
Commercial workloads, not scientific workloads
Log size increase slowly with number of cores

43
Conclusion My Other Research
44
Race Recording

Race recording ? Key to combat nondeterminism
My thesis ? An effective inexpensive Recorder
RTR algorithm ? small log size
Coherence piggyback? Negligible slowdown
Timestamp approximation ? Low hardware cost
Order-value hybrid ? support SC TSO
Future work
Improve race recording algorithm
Improve race recorder implementation
Study race replay

45
Serializability Violation Detector PLDI05

Like a race detector
No a priori annotation requirement
critical sections are inferred
Intend to detect bugs actually happen
Check for a 2-Phase-Locking condition

Read in1
Read local
Write out1
Write local
Read in2
Write out2
46
Publications

FDR (ISCA03)
Adopted by UCSD BugNet (ISCA05)
SVD (PLDI05)
Cited by Vaziri et al. (POPL06)
Influenced new data race definition
RTR, Set/LRU Hybrid
Submitted for publication

47
Thank you!
gdb a.out log gdbgt run Program received
SIGSEGV. In get() at para-hash.c67 67 a
bucket-gtd
gcc para-hash.c a.out Segmentation fault Race
recorded in log
48
Acknowledgements

Joint work with my advisors
Mark Hill, Ras Bodik
Ph.D. Committee
David Wood, Mikko Lipasti, Remzi Arpaci-Dusseau,
Barton Miller
Multifacet Group
Milo Martin, Dan Sorin, Carl Mauer, Brad
Beckmann, Kevin Moore, Alaa Alameldeen, Mike
Marty, Luke Yen
Affiliates Companies
Joe Emer, CJ Newburn, Peter Hsu, Bob Zak, Eric
Bach, Gang Luo, Alex Chow, IBM, Intel, Microsoft,
Sun

49
Deterministic Replay is Useful

Deterministic Replay is logically recreating a
program execution
Present applications
Cyclic Debugging (Pancake Netzer 93)
Fault Tolerance (ExtraVirt Lucchetti et al.
05)
Intrusion Analysis (ReVirt Dunlap et al. 02)
Future applications
Data Recovery
Replay-based Synchronization

50
Multicore and Multithreading

Multicore is common
AMD X2
IBM Power 5/6, Cell
Intel Pentium D, Core Duo
Sun SPARC T1
Multithreading is common
Server high throughput
Scientific high performance
Desktop/embedded low response time

51
Race Recording Key to Determinism

Races general race data race Netzer Miller
Both cause nondeterminism
Race recording can help, but
Existing race recorders are inadequate
Some generate large logs
Some have high runtime overhead
Some have high hardware cost (space overhead)
Support only sequential consistency

Need a better race recorder
52
Recording/Replay Debugging

Online Recorder

Crash
Dump Core
Deterministic Replayer
Crash
53
Deterministic Replay Fault Tolerance

Fault Recovery
Replay after a failure
Fault Detection
Replay then compare

54
Future Record/Replay Undo/Redo

VM as a software platform
Ease software development
Fine granularity in Undo and Redo

55
Future Replay-based Synchronization

Three steps
Coarse-grain sync. ? fine-grain sync. ? hardware
sync.
Results higher performance
Works only if static control flow fixed data
addr
DSP kernels

56
Race Recording Related Work
Total-order recorders Total-order recorders Total-order recorders Partial-order recorders Partial-order recorders Partial-order recorders
Bacon 91 (Hardware) RecPlay 00 JaRec 04 RC90 Déjà Vu 98 Bacon 91 (Hardware) Instant Replay 87 Netzer 93
Bus transactions Lamport Clocks Scheduling Bus transaction groups Variable version Vector clocks
Large log Small log Small log Large log Large log Small log
Low overhead Low overhead (sync only) Low overhead (non-MP) Low overhead High overhead High overhead
Low replay parallelism Low replay parallelism Low replay parallelism High replay parallelism High replay parallelism High replay parallelism
57
Correctness of Order-Value-Hybrid

Removing WAR dependencies
Say thread I read, thread J write
Removing the WAR affects Is read, not Js write
But, for every dependence removed, thread I reads
correct value from the value log
Therefore, all reads get the correct value

58
TR and TSO

TR affects dependencies reduced by a WAR
The WAR itself may later be removed during replay
Solution Not use WAR in TR if the WAR can be
removed
Respond with a special flag when a loaded cache
line is stolen

Thread I
Thread J
1
1
st A
st B
st C
st C
2
2
ld B
ld A
3
3
Recording
Must not be reduced
59
RTR and TSO

The sliding window may expose the ordered loads
Shrink the sliding window to avoid it

Thread I
Thread J
ordered
1
1
st A
add
new win for j3
add
sub
old win for j3
2
2
in write bufffer
st B
ld A
3
3
ordered
ld C
ld B
4
4
Recording
Not allowed by new window
60
Deadlock Avoidance of RTR
Thread I
Thread J
1
1
ld A
add
st B
st C
2
2
st C
ld B
3
3
st A
ld D
4
4
sub
st C
5
5
ld B
st D
6
6
Recording
Avoid deadlock by adhere to a SC total order
61
Recording Race-free Executions

No data races
Only need to record synchronization race
Deterministic replay up until the first data race

62
Replay Parallelism

Replay performance depends on
Number of synchronizations
Extra wait incurred by the synchronizations

63
Directory Protocols

Add sticky states in the directory
Retain states after writebacks
Need extra acknowledgements
Or, add extra timestamp memory in the directory
Helps to avoid extra acknowledgements
A tradeoff
Sticky states can be cheaper
But extra timestamp memory can be faster

64
Snooping Protocols

Key problem is combined/implicit response
Not a problem for AMD Hammer

Proc I
Proc J
st A
Tag State Data Timestamp A S 1 B M
4
Tag State Data Timestamp A S 3 B I
2
Current IC
WAR Detected Logged
65
Nonsilent Evictions
Proc I
Proc J
st A
Tag State Data Timestamp A S 1 B M
4
Tag State Data Timestamp A S 3 B I
2
M 4
C M 3
Directory of A Shared(J) Owner() StickyS(I,J)
Directory eviction more false conflict, like
snooping
66
Out-of-Order Hardware Prefetching