What is a Data Race? - PowerPoint PPT Presentation

Loading...

PPT – What is a Data Race? PowerPoint presentation | free to view - id: 7465bb-YjZiM



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

What is a Data Race?

Description:

What is a Data Race? Two concurrent accesses to a shared location, at least one of them for writing. Indicative of a bug Thread 1 Thread 2 X++ T=Y – PowerPoint PPT presentation

Number of Views:36
Avg rating:3.0/5.0
Slides: 26
Provided by: elip152
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: What is a Data Race?


1
What is a Data Race?
  • Two concurrent accesses to a shared location, at
    least one of them for writing.
  • Indicative of a bug

Thread 1 Thread
2 X TY Z2 TX
2
How Can Data Races be Prevented?
  • Explicit synchronization between threads
  • Locks
  • Critical Sections
  • Barriers
  • Mutexes
  • Semaphores
  • Monitors
  • Events
  • Etc.

Lock(m) Unlock(m) Lock(m) Unlock(m)
Thread 1 Thread 2 X TX
3
Is This Sufficient?
  • Yes!
  • No!
  • Programmer dependent
  • Correctness programmer may forget to synch
  • Need tools to detect data races
  • Expensive
  • Efficiency to achieve correctness, programmer
    may overdo.
  • Need tools to remove excessive synchs

4
Where is Waldo?
  • define N 100
  • Type g_stack new TypeN
  • int g_counter 0
  • Lock g_lock
  • void push( Type obj )lock(g_lock)...unlock(g_lo
    ck)
  • void pop( Type obj ) lock(g_lock)...unlock(g_lo
    ck)
  • void popAll( )
  • lock(g_lock)
  • delete g_stack
  • g_stack new TypeN
  • g_counter 0
  • unlock(g_lock)
  • int find( Type obj, int number )
  • lock(g_lock)
  • for (int i 0 i lt number i)
  • if (obj g_stacki) break // Found!!!
  • if (i number) i -1 // Not found Return
    -1 to caller

5
Can You Find the Race?
  • define N 100
  • Type g_stack new TypeN
  • int g_counter 0
  • Lock g_lock
  • void push( Type obj )lock(g_lock)...unlock(g_lo
    ck)
  • void pop( Type obj ) lock(g_lock)...unlock(g_lo
    ck)
  • void popAll( )
  • lock(g_lock)
  • delete g_stack
  • g_stack new TypeN
  • g_counter 0
  • unlock(g_lock)
  • int find( Type obj, int number )
  • lock(g_lock)
  • for (int i 0 i lt number i)
  • if (obj g_stacki) break // Found!!!
  • if (i number) i -1 // Not found Return
    -1 to caller

Similar problem was found in java.util.Vector
write
read
6
Detecting Data Races?
  • NP-hard NetzerMiller 1990
  • Input size instructions performed
  • Even for 3 threads only
  • Even with no loops/recursion
  • Execution orders/scheduling (threads)thread_leng
    th
  • inputs
  • Detection-codes side-effects
  • Weak memory, instruction reorder, atomicity

7
Motivation
  • Run-time framework goals
  • Collect a complete trace of a programs user-mode
    execution
  • Keep the tracing overhead for both space and time
    low
  • Re-simulate the traced execution
    deterministically based on the collected trace
    with full fidelity down to the instruction level
  • Full fidelity user mode only, no tracing of
    kernel, only user-mode I/O callbacks
  • Advantages
  • Complete program trace that can be analyzed from
    multiple perspectives (replay analyzers
    debuggers, locality, etc)
  • Trace can be collected on one machine and
    re-played on other machines (or perform live
    analysis by streaming)
  • Challenges Trace Size and Performance

8
Original Record-Replay Approaches
  • InstantReplay 87
  • Record order or memory accesses
  • overhead may affect program behavior
  • RecPlay 00
  • Record only synchronizations
  • Not deterministic if have data races
  • Netzer 93
  • Record optimal trace
  • too expensive to keep track of all memory
    locations
  • Bacon Goldstein 91
  • Record memory bus transactions with hardware
  • high logging bandwidth

9
Motivation
  • Increasing use and development for multi-core
    processors
  • MT program behavior is non-deterministic
  • To effectively debug software, developers must be
    able to replay executions that exhibit
    concurrency bugs
  • Shared memory updates happen in different order

10
Related Concepts
  • Runtime interpretation/translation of binary
    instructions
  • Requires no static instrumentation, or special
    symbol information
  • Handle dynamically generated code, self modifying
    code
  • Recording/Logging 100-200x
  • More recent logging
  • Proposed hardware support (for MT domain)
  • FDR (Flight Data Recorder)
  • BugNet (cache bits set on first load)
  • RTR (Regulated Transitive Reduction)
  • DeLorean (ISCA 2008- chunks of instructions)
  • Strata (time layer across all the logs for the
    running threads)
  • iDNA (Diagnostic infrastructure using NirvanA-
    Microsoft)

11
Deterministic Replay
  • Re-execute the exact same sequence of
    instructions as recorded in a previous run
  • Single threaded programs
  • Record Load Values needed for reproducing
    behavior of a run (Load Log)
  • Registers updated by system calls and signal
    handlers (Reg Log)
  • Output of special instructions RDTSC, CPUID (Reg
    Log)
  • System call (virtualization- cloning arguments,
    updates)
  • Checkpointing (log summary 10Million)
  • Multi-threaded programs
  • Log interleaving among threads (shared memory
    updates ordering SMO Log)

12
PinSEL System Effect Log (SEL)
  • Logging program load values needed for
    deterministic replay
  • First access from a memory location
  • Values modified by the system (system effect) and
    read by program
  • Machine and time sensitive instructions
    (cpuid,rdtsc)

Store A (A ? 111)
Program execution
Store B (B ? 55)
Load C (C 9)
Load D (D 10)
Syscall modifies location (B -gt 0) and (C -gt 99)
Load A (A 111)
system call
Load B (B 0)
Logged
Load C (C 99)
Not Logged
Load D (D 10)
  • Trace size is 4-5 bytes per instruction

13
Optimization Trace select reads
  • Observation Hardware caches eliminate most
    off-chip reads
  • Optimize logging
  • Logger and replayer simulate identical cache
    memories
  • Simple cache (the memory copy structure) to
    decide which values to log. No tags or valid bits
    to check. If the values mismatch they are logged.
  • Average trace size is lt1 bit per instruction
  • The only read not predicted and logged follows
    the system call

14
Example Overhead
  • PinSEL and PinPLAY
  • Initial work (2006) with single threaded
    programs
  • SPEC2000 ref runs 130x slowdown for pinSEL and
    80x for PinPLAY (w/o in-lining)
  • Working with a subset of SPLASH2 benchmarks 230x
    slowdown for PinSEL
  • Now Geo-mean SPEC2006
  • Pin 1.4x
  • Logger 83.6x
  • Replayer 1.4x

15
Example Microsoft iDNA Trace Writer Performance
Application Simulated Instructions (millions) Trace File Size Trace File Bits / Instruction Native Execution Time Execution Time While Tracing Execution Overhead
Gzip 24,097 245 MB 0.09 11.7s 187s 15.98
Excel 1,781 99 MB 0.47 18.2s 105s 5.76
Power Point 7,392 528 MB 0.60 43.6s 247s 5.66
IE 116 5 MB 0.50 0.499s 6.94s 13.90
Vulcan 2,408 152 MB 0.53 2.74s 46.6s 17.01
Satsolver 9,431 1300 MB 1.16 9.78s 127s 12.98
  • Memchecker and valgrind are in 30-40x range on
    CPU 2006
  • iDNA 11x, (does not log shared-memory
    dependences explicitly)
  • Use a sequential number for every lock prefixed
    memory operation offline data race analysis

16
Logging Shared Memory Ordering(Cristianos
PinSEL/PLAY Overview)
  • Emulation of Directory Based Cache Coherence
  • Identifies RAW, WAR, WAW dependences
  • Indexed by hashing effective address
  • Each entry represents an address range

Directory
Dir Entry
Dir Entry
Store A
Program execution
Dir Entry
hash
Load B
Dir Entry
17
Directory Entries
  • Every DirEntry maintains
  • Thread id of the last_writer
  • A timestamp is the of memory ref. the thread
    has executed
  • Vector of timestamps of last access for each
    thread to that entry
  • On Loads update the timestamp for the thread in
    the entry
  • On Stores update the timestamp and the
    last_writer fields

Directory
Thread T1
Thread T2
DirEntry AD
Last writer id
T1
T2
1 Store A
1 Load F
1
T1
T2
2
2
Vector
2 Store A
Program execution
DirEntry EH
2 Load A
3 Load F
Last writer id
T1
3 Store F
1
3
3
T1
T2
18
Detecting Dependences
  • RAW dependency between threads T and T is
    established if
  • T executes a load that maps to the directory
    entry A
  • T is the last_writer for the same entry
  • WAW dependency between T and T is established
    if
  • T executes a store that maps to the directory
    entry A
  • T is the last_writer for the same entry
  • WAR dependency between T and T is established
    if
  • T executes a store that maps to the directory
    entry A
  • T has accessed the same entry in the past and T
    is not the last_writer

19
Example
Thread T1
Thread T2
DirEntry AD
Last writer id
T2
T1
1 Store A
1 Load F
WAW
T1
T2
1
2
2
2 Store A
Program execution
DirEntry EH
RAW
2 Load A
3 Load F
Last writer id
T1
WAR
3 Store F
1
3
3
T1
T2
Last_writer
Last access to the DirEntry
Last access to the DirEntry
SMO logs
Thread T2 cannot execute memory reference 2
until T1 executes its memory reference 1
T2 2 T1 1
T1 2 T2 2
T1 3 T2 3
Thread T1 cannot execute memory reference 2 until
T2 executes its memory reference 2
20
Ordering Memory Accesses (Reducing log size)
  • Preserving order will reproduce execution
  • a?b a happens-before b
  • Ordering is transitive a?b, b?c means a?c
  • Two instructions must be ordered if
  • they both access the same memory, and
  • one of them is a write

21
Constraints Enforcing Order
  • To guarantee a?d
  • a?d
  • b?d
  • a?c
  • b?c
  • Suppose we need b?c
  • b?c is necessary
  • a?d is redundant

P1
P2
a
overconstrained
b
c
d
22
Problem Formulation
Dependence (black)
Conflicts (red)
Thread I
Thread J
Thread I
Thread J
ld A
add
ld A
add
st B
st B
st C
st C
st C
Log
st C
ld B
ld B
ld D
st A
ld D
st A
sub
sub
st C
st C
ld B
ld B
st D
st D
Recording
Replay
  • Reproduce exact same conflicts no more, no less

23
Log All Conflicts
Thread I
Thread J
ld A
add
st B
st C
st C
ld B
st A
ld D
sub
st C
ld B
st D
Replay
  • Assign IC
  • (logical Timestamps)
  • But too many conflicts
  • ? Detect conflicts ? Write log

24
Netzers Transitive Reduction
Thread I
Thread J
TR reduced
1
1
ld A
add
st B
st C
2
2
st C
ld B
3
3
st A
ld D
4
4
sub
st C
5
5
ld B
st D
6
6
Replay
25
RTR (Regulated Transitive Reduction) Stricter
Dependences to Aid Vectorization
Thread I
Thread J
1
1
ld A
add
st B
st C
2
2
st C
ld B
3
3
st A
ld D
4
4
sub
st C
5
5
ld B
st D
6
6
Replay
4 Overhead RTRFDR (simulated on GEMs) .2
MB/core/second logging (Apache)
About PowerShow.com