Symbolic Program Consistency Checking of OpenMP Parallel Programs with Relaxed Memory Models - PowerPoint PPT Presentation

Loading...

PPT – Symbolic Program Consistency Checking of OpenMP Parallel Programs with Relaxed Memory Models PowerPoint presentation | free to download - id: 6b3d59-ZGU5M



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

Symbolic Program Consistency Checking of OpenMP Parallel Programs with Relaxed Memory Models

Description:

Symbolic Program Consistency Checking of OpenMP Parallel Programs with Relaxed Memory Models Based on an LCTES 2012 paper. Shun-Ching Yang Guan-Cheng Chen – PowerPoint PPT presentation

Number of Views:3
Avg rating:3.0/5.0
Date added: 11 March 2020
Slides: 48
Provided by: ccEeNtuE2
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Symbolic Program Consistency Checking of OpenMP Parallel Programs with Relaxed Memory Models


1
Symbolic Program Consistency Checking of OpenMP
Parallel Programs with Relaxed Memory Models
Based on an LCTES 2012 paper.
Shun-Ching Yang Guan-Cheng Chen
Che-Chang Chan National Taiwan University
Farn Wang National Taiwan University
Academia Sinica
  • Fang Yu
  • National Cheng Chi University

2
Outline
  • Introduction
  • Motivation
  • Parallel program correctness
  • Related work
  • 2-step program consistency checking
  • Step 1 Static race constraint solution
  • Step 2 Guided simulation
  • Extended finite-state machine (EFSM), relaxed
    memory models
  • Implementation
  • Experiments
  • Conclusion

3
Motivation (1/4)
  • Parallel Programming
  • Multi-cores,
  • General purpose computation on GPU (GPGPU)
  • Distributed computing, cloud computing
  • Challenges
  • Parallel loops, chunk sizes, threads, schedules
  • Arrays, pointer aliases,
  • Relaxed memory models

4
Motivation (2/4)
A Running example of C OpenMP
for(k0kltsize-1k) pragma omp parallel
for default (none) shared(M,L,size,k)
private(i,j) schedule(static,1)
num_thread(4) for(ik1,iltsizei)
Lik Mik/Mkk
for(jk1jltsizej) Mij
Mij Li-1kMkj
5
Motivation (3/4)
for(k0kltsize-1k) pragma omp parallel
for default (none) shared(M,L,size,k)
private(i,j) schedule(static,c)
num_thread(4) for(ik1,iltsizei)
Lik Mik/Mkk
for(jk1jltsizej) Mij
Mij Li-1kMkj
Thread1 k1, , k1c-1, Thread2 k1c , ,
k12c-1 Thread3 k12c , , k13c-1 Thread4
k13c, , k14c-1 Thread1 k14c, ,
k15c-1 .
6
Motivation (4/4)
  • Many programming supports
  • forks joins
  • P-threads
  • Open Multi-Processing (OpenMP)
  • Thread Building Blocks
  • Microsoft

7
Parallel Program Correctness (1/4)
  • Program level, what users care about
  • Determinism
  • For all input, all executions yield the same
    output.
  • Consistency
  • All executions yield the same output as the
    sequential execution.
  • Race-freedom
  • Parallel executions do not yield different
    results.
  • All seemingly equivalent at program level.
  • unless sequential execution is not a parallel
    execution.

8
Parallel Program Correctness (2/4)
parallel for
parallel while
parallel for
parallel for
  • Checking the correctness property of each
    parallel region (PR)
  • Correctness at PRs
  • ? correctness of the program

9
Parallel Program Correctness (3/4)
  • In practice
  • It may be unclear what the program result is.
  • Instead, properties for correctness at PR level
    are usually checked.
  • determinism
  • consistency
  • race-freedom
  • At RW schedule levels, values do not count.
  • linearizability (transaction levels)

10
Parallel Program Correctness (4/4)
  • Linearizability (Transaction level)
  • ? race-freedom (PR RW level)
  • ? determinism (PR level)
  • consistency (PR level)
  • ? race-freedom (program level)
  • determinism (program level)
  • consistency (program level)
  • ? program correctness

11
Related Work (1/4)
  • Thread analyzer of Sun Studio Lin 2008
  • Static race detection, no arrays
  • Intel Thread Checker Petersen Shah 2003
  • Dynamic approach
  • Instrumentation approach on client-server for
    race detection Kang et al. 2009
  • Run-time monitoring in OpenMP programs
  • OmpVerify Basupalli et al. 2011
  • Polyhedral analysis for Affine Control Loops

12
Related Work in PLDI 2012 (2/4) no simulation as
the 2nd step
  • Detect races via liquid effects Kawaguchi,
    Rondon, Bakst, Jhala
  • type inferencing for precise race detection.
  • no arrays.
  • Speculative Linearizability Guerraoui,Kuncak,Losa
     
  • Reasoning about Relaxed Programs Carbin, Kim,
    Misailovic, Rinard
  • Parallelizing Top-Down Interprocedural Analysis
    Albarghouthi, Kumar, Nori, Rajamani

13
Related Work in PLDI 2012 (3/4) no simulation as
the 2nd step
  • Sound and Precise Analysis of Parallel Programs
    through Schedule-Specialization Wu, Tang, Hu, et
    al
  • Race Detection for Web Applications Petrov,
    Vechev, Sridharan, Dolby
  • Concurrent Data Representation Synthesis
    Hawkins, Aiken, Fisher2, et al
  • Dynamic Synthesis for Relaxed Memory Models Liu,
    Nedev, Prisadnikov, et al

14
Related Work in PLDI 2012 (4/4) no simulation as
the 2nd step
  • Tools
  • Parcae Raman, Zaks, Lee 3, et al
  • Chimera Lee, Chen, Flinn, Narayanasamy
  • Janus Tripp1, Manevich, Field, Sagiv
  • Reagents Turon

15
Methodology (1/2)
  • Assumptions
  • Arrays do not overlap.
  • No pointers other than arrays.
  • Fixed threads, chunk size, scheduling policy.
  • We analyze consistency of program implementation.
  • Focusing on OpenMP.
  • The techniques should be applicable to other
    frameworks.
  • Output result prescribed by users.

16
Why OpenMP ?
  • Complicate enough
  • Practical enough
  • Parallelizes programs automatically
  • Is an industry standard of application
    programming interface (API)
  • Is supported by Sun Studio, Intel Parallel
    Studio, Visual C, GNU Compiler Collection
    (GCC).

17
Methodology (2/2)
  • 2-step program consistency checking.

Program Consistency checking
Potential race analysis at PR level
Potential race report
Guided simulation for program consistency
violations
end
18
Step 1 Potential Races at PR level
  • Necessary constraints as Presburger formulas
  • A race constraint between each pair of memory
    references to the same location by different
    threads.
  • Solution of the pairwise constraints via
    Presburger formula solving.

19
Step 1 Potential Race Analysis
C program with OpenMP
Pairwise Constraints Generator
Pairwise Race Constraints
Consraint Solver
Race-freedom
No
Yes
Potential races (Truth Assignment)
Sat?
20
Potential Race Constraint
  • A Potential Race Constraint
  • Thread Path Condition ? Race Condition
  • Thread Path Condition
  • Necessary for a thread to access a memory
    location in a statement
  • Obtained by symbolic postcondition analysis
  • Race Condition
  • The necessary condition of an access by two
    threads in a parallel region

21
Running example
for(k0kltsize-1k) pragma omp parallel for
default (none) shared(M,L,size,k)
private(i,j) schedule(static,c)
num_thread(4) for(ik1,iltsizei)
Lik Mik/Mkk
for(jk1jltsizej) Mij
Mij Li-1kMkj
Thread1 k1, , k1c-1, Thread2 k1c , ,
k12c-1 Thread3 k12c , , k13c-1 Thread4
k13c, , k14c-1 Thread1 k14c, ,
k15c-1 .
22
Thread Path Condition of Lik
for(k0kltsize-1k) pragma omp parallel for
default (none) shared(M,L,size,k)
private(i,j) schedule(static,c)
num_thread(4) for(ik1,iltsizei)
Lik Mik/Mkk
for(jk1jltsizej) Mij
Mij Li-1kMkj
Thread 1 it1-(k1)40 ? k1 i t1lt size
23
Thread Path Conditions of Li-1k
for(k0kltsize-1k) pragma omp parallel for
default (none) shared(M,L,size,k)
private(i,j) schedule(static,c)
num_thread(4) for(ik1,iltsizei)
Lik Mik/Mkk
for(jk1jltsizej) Mij
Mij Li-1k Mkj
Thread 2 it2-(k1)-1 4 0 ? k1 it2 lt
size ? k1 jt2 lt size
24
Race Condition of Lik Li-1k
for(k0kltsize-1k) pragma omp parallel for
default (none) shared(M,L,size,k)
private(i,j) schedule(static,c)
num_thread(4) for(ik1,iltsizei)
Lik Mik/Mkk
for(jk1jltsizej) Mij
Mij Li-1k Mkj

it1-(k1) 4 0 ? k1 it1 lt size ?
it2-(k1)-1 40 ? k1 it2 lt size ? k1 jt2
lt size ? k k ? it1 it2 -1
25
Potential Race Constraint Solving
All Presburger
Potential races (Omega lib.) . . i_1
k14alpha . . i_2 k24alpha . . i_2 i_11 .
. i_1 lt size . . i_2 lt size . . k1 lt i_1 . .
k1 lt i_2 . . k1 lt j_2 . . j_2 lt size i_1 0
?0,?), not_tight i_2 0 ?0,?), not_tight
it1-(k1) 4 0 ? k1 it1 lt size ?
it2-(k1)-1 40 ? k1 it2 lt size ? k1 jt2
lt size ? k k ? it1 it2 -1
26
Step 2 Guided symbolic simulation
  • Program models
  • Extended finite-state machine (EFSM)
  • Relaxed memory model
  • Simulator of EFSM
  • Stepwise, backtrack, fixed point
  • Witness of program consistency violations
  • comparison with the sequential execution result.

27
Guided Simulation
C program with OpenMP
Model Generator
Model (EFSM)
Potential races (from step 1)
No
Simulation
Yes
fixed point ?
Consistency ?
No
Yes
Consistency violations
Consistency (w. benign races)
28
C Program Model Construction (1/2)
Example pragma omp for schedule(Static, c)
num_threads(m)
for(xixltjx) S
start
y is an auxiliary local variable for chunk. t is
the serial number of the thread.
(true) x(t-1) c i y0
(xltj?yltc-1) x y
(x-ymc? j ? yc-1) xx-ymc y0
S
(xgt j)
(x-ymcgtj ? yc-1)
stop
29
C Program Model Construction (2/2)
  • To model races in a C statement
  • y f(x1, x2, , xn)
  • assume reads x1, x2,, xn in order.
  • other orders can also be modeled.
  • Translate to the following n1 EFSM transitions
  • a1x1 a2x2 anxn yf(a1,,an)
  • a1, a2, , an are auxiliary variables in EFSM.

30
Relaxed Memory Models
  • Out-of-order execution of accesses to the memory
    for hardware efficiency.
  • local caches, multiprocessors
  • for customized synchronizations, controlled races
  • May lead to unexpected result.
  • A classical example
  • initially x0 ? y 0
  • thread 1 x1 thread 2 y 1
  • z y w x
  • assert z1?w1

31
Relaxed Memory Models
  • A classical example
  • initially x0 ? y 0

x.c11 y.c11 load(w.c2,x) load(z.c1,y) store(x.c1
) xx.c1 store(y.c2) yy.c2
thread 1 x1 zy
cache 1
store
memory
load
store
thread 2 y1 wx
cache 2
load
assert z1?w1
32
Relaxed Memory Models
  • Total store order (TSO)
  • From SPARC
  • Adapted to Intel 80x86 series
  • Description
  • Local reads can use pending writes in the local
    store.
  • Problem Peer reads are not aware of the local
    pending writes.
  • Local stores must be FIFO.

33
Modeling TSO w. m threads (1/4)
  • An array x0..m for each shared variable x
  • x0 is the memory copy.
  • xi is the cache copy of x of thread i ?1,m
  • x now becomes an address variable instead of the
    value variable for x.

34
Modeling TSO w. m threads (2/4)
  • An arrays ls0..n of objects for load-store (LS)
    buffer of size n1.
  • ls_stk status of load-store buffer cell k
  • 0 not used, 1load, 2 store
  • ls_thk thread that use load-store buffer cell
    k.
  • ls_dstk, ls_srck destination and source
    addresses
  • ls_value value to store
  • Purely for convenience.
  • Can be changed to m load-store buffers for each
    thread.
  • Need know mappings from threads to cores

35
Modeling TSO w. m threads (3/4)
  • Load a x by thread j, a is private.

PW ? steps EFSM transitions
Pending Write (PW) 1 Thread J !load_at_Q ? ls_src_at_(Q) x ls_dst a
Pending Write (PW) 1 LS Q must be the largest PW LS object. ?load_at_J ? ls_thJ ls_status 1
Pending Write (PW) 2 Thread J ?load_finish
Pending Write (PW) 2 LS Q !load_finish_at_(ls_th)?l s_dst0ls_value ls_th0 ls_status0 compact LS array
No pending Write 1 Thread J !load_at_Q ? ls_src_at_(Q) x ls_dst a
No pending Write 1 LS Q must be the smallest idle LS obj. ?load_at_J ? ls_thJ ls_status 1
No pending Write 2 Thread J ?load_finish
No pending Write 2 LS Q !load_finish_at_(ls_th) ? ls_dst0ls_src0ls_th0 ls_status0 compact LS array
36
Modeling TSO w. m threads (4/4)
  • Store a x by thread j, a is private.

steps EFSM transitions
1 Thread J !store_at_Q ? ls_dst_at_(Q) x ls_value a
1 LS Q must be the smallest idle LS obj. ?store_at_J ? ls_thJ ls_status 2
2 LS Q ls_dst0 ls_value ls_th0 ls_status0 compact LS array
37
Guided Simulation
  • For each pairwise race condition truth
    assignment, perform a simulation session.
  • Use a stack to explore the simulation paths.
  • Explore all paths compatible with the truth
    assignment.
  • Check consistency at the end of each path.
  • Mark benign races.

38
Implementation
  • Pathg path generator
  • Pontential race condition solving
  • Presburger ?Omega library
  • Model construction
  • REDLIB for EFSM with synchronizations, arrays,
    variable declarations, address arithmetics
  • Guided EFSM simulation
  • REDLIB semi-symbolic simulator
  • step, backtrack, check fixpoint/consistency

39
Implementation Guided Symbolic Simulation
Guided Multi-Threaded Simulation
Sequential execution(Golden model)
Memory Accessing Sequence
Master Thread
Memory Accessing Sequence
Master Thread
Parallel Task 1
ReadL21 WriteL21 ReadL21 ReadL2
1 WriteL21 . . . .
ReadL21 ReadL21 WriteL21 ReadL2
1 WriteL21 . . . .
Parallel Task 1
Parallel Task 2
Parallel Task 3
Parallel Task 2
Parallel Task 3
Master Thread
Master Thread
output
output
40
Implementation Potential Race Report
tg indicates threads involved in the race. tw
indicates threads WRITE the Memory address. Race
is where the race condition is. We enumerate
variables to limit the solution
tgi_4,i_1twi_4 RaceL51 tgi_3,i
_4twi_3 RaceL41 tgi_2,i_3twi
_2 Race L31 tgi_1,i_2twi_1 Race
L21
41
Experiments
  • Environment
  • Ubuntu 9.10 64bit
  • i5-760 2.8GHz and 2GB RAM
  • Benchmarks
  • OpenMP Source Code Repository (OmpSCR)
  • NAS Parallel Benchmarks (NPB)

42
Constraint Solving of OmpSCR
  • Bug v1 Races manually introduced (between any
    two threads dealing with the consecutive
    iterations)
  • Bug v2 Rare races introduced (only between two
    specific threads on a particular share memory)
  • Fixed A barrier statement manually inserted
    (remove the race in Bug v2)

Benchmark Original Original Original Bug v1 Bug v1 Bug v1 Bug v2 Bug v2 Bug v2 Fixed Fixed Fixed
Benchmark Const. Sat Time Const. Sat Time Const. Sat Time Const. Sat Time
c_lu.c 71 0 0.18s 629 29 1.810s 935 30 4.110s 935 0 5.15s
c_ja01.c 95 0 0.39s 95 8 0.42s 155 1 0.75s 95 0 0.77s
c_ja02.c 95 0 0.03s 95 8 0.35s 155 1 0.67s 95 0 1.03s
c_loopA.c 17 0 0.04s 47 4 0.07s 95 1 0.32s 17 0 0.84s
c_loopB.c 17 0 0.03s 29 4 0.08s 95 1 0.15s 17 0 1.13s
c_md.c 65 0 0.25s 77 4 0.30s 131 1 0.53s 65 0 1.25s
43
Symbolic Simulation of OmpSCR
  • Blindly simulation needs to explore (much) more
    traces to hit a consistency violation!
  • Standard OpenMP tools fail to report races of
    these benchmarks.

Benchmarks Guided simulation Guided simulation Random simulation Random simulation Sun Studio Intel Thread Checker
Benchmarks Traces Time Trace Time race Race/total
c_lu_bug1 1 23.35s 25.3 52.11s N 4/10
c_lu_bug2 1 23.22s 178.9 110.58s N 1/10
c_ja01_bug1 1 6.65s 10.6 26.60s N 4/10
c_ja01_bug2 1 13.91s 42.1 58.16s N 3/10
c_ja_02_bug1 1 14.86s 25 28.83s N 2/10
c_ja_02_bug2 1 15.19s 41.3 52.25s N 2/10
c_loopA_bug1 1 10.76s 11.7 36.82s N 3/10
c_loopA_bug2 1 56.86s 27.6 98.40s N 2/10
c_loopB_bug1 1 14.54s 9.4 29.58s N 2/10
c_loopB_bug2 1 41.50s 38.6 66.48s N 2/10
c_md_bug1 1 12.19s 10.4 26.21s N 3/10
c_md_bug2 1 19.38s 44.3 83.52s N 2/10
44
NAS Parallel Benchmarks
  • Middle-size benchmarks (12003500 loc)
  • Efficient race constraint solving
  • e.g., 150000 race constraints solved in 38
    minutes by omega library
  • Rare satisfiable constraints
  • 8/85067 constraints of nas_lu.c

Benchmark loc Access Const. Sat Time
nas_lu.c 3481 13736 85067 8 27m30.37s
bt.c 3616 15916 157047 0 37m33.32s
mg.c 1250 4636 2269 0 0m17.19s
sp.c 2983 13604 45209 0 4m0.32s
45
nas_lu.c
  • Slice the program to the segment of the
    paralleled region with satisfiable race
    conditions
  • Construct the symbolic model of the sliced
    segment
  • 35 Modes (EFSM)
  • Reaching the fixed point without consistency
    violation after 205 steps and 16.93secs
  • Benign races
  • All of them are used as mutual exclusion
    semaphores
  • nas_lu.c is consistent

46
Conclusion
  • Static analysis of program consistency
  • for real C/C program with OpenMP directives
  • Highly automated solution
  • Constraint solving
  • Symbolic simulation
  • High precision relaxed memory models
  • High efficiency
  • Extension to TBB, other memory models ?
  • Partial order reduction ?

47
Conclusion
  • Symbolic approach for static consistency checking
  • Detect and identify races by solving race
    constraints (Presburger formulas)
  • Construct symbolic models and perform guided
    simulation with races
  • Support relaxed memory models
  • Find consistency violations effectively (when
    existing)
About PowerShow.com