Title: Shared Memory Consistency Models : A broad survey Ganesh Gopalakrishnan* School of Computing, University of Utah, Salt Lake City, UT
1 Shared Memory Consistency Models A broad
survey Ganesh Gopalakrishnan School of
Computing, University of Utah, Salt Lake City,
UT
Past work supported in part by SRC Contract
1031.001, NSF Award 0219805 and an equipment
grant from Intel Corporation
2Shared Memory Hardware Realities
3Shared Memory Software Realities
- Must define the formal semantics of
shared-memory concurrent - programming while allowing for all reasonable
optimizations - Defining the Shared Thread semantics for Java
(Original Java - books Chapter 17 has essentially been ripped
out) - Defining the Shared Memory Model for new
languages such as - Unified Parallel C (UPC) for Scientific
Programming - At a deeper level Must have formal basis for
Automatic - Minimal Fence Insertion to make programs appear
to execute - sequentially consistent
4Topics
- Motivations for strong and weak memory models
- - How it affects consistency protocol design
- - How it affects programming
- Classical memory models
- - Their power
- Fence insertion during compilation
- - Run on weak architectures but appear to
run SC - Overview of some weak architectures
- Itanium in a nutshell
- SAT-based programs that check executions against
memory - model specs
- - Demo of MP Execution Checker (MPEC) tool
for Itanium
5Topics
- Theoretical aspects of memory model
specification - - Specify using Traces or Specify using
Transducers - Why Traced-based Specification can allow one to
talk about - unrealizable machines
- - Hence undecidability of sequential
consistency is not a - solved problem
- Why trace-based verification methods need to
exert some care - - Otherwise can prove conniving machines to
be SC !! - A brief taxonomy of recent results in this area
- - Mainly Alur et.al., Qadeer, Bingham et.al.,
and Sezgin
6Sequential Consistency The Most Basic Memory
Consistency Model
- Exists a common total order
- Respects program order
- Read sees the latest write
Example
Initially, x y 0. Finally, can r1 r2
0? Thread 1 Thread
2
x 1 r1 y
y 2 r2 x
Under Sequential Consistency No Under many weak
models Yes
7How to Think About Sequential Consistency
P1
P2
Pn
Memory
Initially, x y 0. Finally, can r1 r2
0? Thread 1 Thread
2
x 1 r1 y
y 2 r2 x
No! Not under SC ! But possible under many weak
memory models! An example of such a weak memory
model is Sparc TSO
8Coherence Per-location Sequential Consistency
P1
P2
Pn
1-address Memory
Initially, x y 0. Finally, can r1 r2
0? Thread 1 Thread
2
x 1 r1 y
y 2 r2 x
Notice that the same execution is Coherent !
9Memory Consistency Models
Defines the legal orderings of memory operations
that can be perceived at the user level
- Processors intermittently throw colors onto
- memory cells and also intermittently look at
their colors
P1
P2
Pn
Pi
Memory Cell 1
Memory Cell 2
Memory Cell n
10Memory Consistency Models
Defines the legal orderings of memory operations
that can be perceived at the user level
- Many have been developed
- Sequential Consistency (SC)
- Coherence (per-location SC)
- Parallel Random Access Memory (PRAM)
- Causal Consistency
- Processor Consistency (PC)
- Release Consistency
- Location Consistency
- The Intel Itanim Memory Model
- Java Memory Model (JMM)
- and more!
11Memory Consistency Model Specifications
A VERY complex specification for a real
architecture (e.g. Itanium, PowerPC, ) Also
of growing concern in Software (e.g. Java
Memory Model, Unified Parallel C model, )
12Motivation for (weak) Memory Consistency models
A Hardware Perspective
- Cannot afford to do industrious updates across
large MP - systems
- Delayed and re-orderable updates allow
considerable latitude - in memory consistency protocol design ? less
bugs in protocols !!
Intra-cluster protocols
Chip-level protocols
dir
dir
Inter-cluster protocols
mem
mem
13Price Paid for Delayed Updates Bugs!
- Algorithms such as Petersons Mutual Exclusion
cease to work! - Thread 1
Thread 2 - ------------
----------- - Flags1 BUSY
Flags2 BUSY - Turn 2
Turn 1 - While (Flags2 BUSY While
(Flags1 BUSY - Turn ! 1)
Turn ! 2) - Critical section Critical
section - Flags1 FREE
FLAGS2 FREE
CAN READ OLD VALUE!!
CAN READ OLD VALUE!!
14Scope of Tutorial
- Survey of Classical Work
- Survey of Current Activities (that this speaker
is aware of) - Verification Challenges
- Theoretical Questions
- Justification for topic selection
- Complement talks on Shared Memory Consistency
Protocols - Intuitions more important than the detailzzz.
- Knowing whos who in this area helps
- Excuse for me to stick my neck out and learn
something new
15Organization
- Overview (mainly of classical works)
- Practical aspects of weak consistency models
(more depth) - Whats not apparent at first glance (still more
depth) - Conclusions and references
16Part 1 Overview of Classical Work
17Memory Serves to Plumb Data
Uniprocessor Write ( address 2 , data 33)
.. Read ( address 2 , returns data 33)
Multiprocessor P1
P2 ---- ---- Write (2,
33) Read (2, 33)
?
?
but respecting Coherence!
Multiprocessor P1
P2 ----
---- Write(2, 33) Write(2, 77)
Read (2, 77) Read(2, 33)
P1
P2 P3
P4 ----
---- ----
---- Write (2, 33) Write
(2, 77) Read(2, 33) Read(2,
77)
Read(2, 77)
Read(2, 33)
?
?
18but Coherence is not sufficient
From Shasha and Snir, Figure 1, P. 282 (ACM
TOPLAS (10)2 1988)
Processor 1
Processor 2 -------------
-------------- Test_and_set1(LOCK)
Test_and_set2(LOCK) Read1(X)
Read2(X) Write1(X)
Write2(X) Reset1(L
OCK)
Reset2(LOCK)
The following memory access sequence respects
Coherence but breaks the critical section
Test_and_set1(LOCK) Read1(X) Reset1(LOCK)
Test_and_set2(LOCK) Read2(X)
Write1(X) Write2(X) Reset2(LOCK)
- Consistent view ACROSS ADDRESS SPACE is needed
- Most intuitive such Sequential Consistency !
19Basic understanding of SC
- Execute AS IF instructions in each thread were
- executed sequentially and atomically
- - respecting the program order in each thread
- - no constraints across sequential programs
Requires effort to achieve above effect AS WELL
AS high performance
Write (4, 66) MISSES Read (2, 22) HITS
Write (2, 55) MISSES Read (4, 11) HITS
Which Read waits ?
CPU 1
CPU n
Memory and Bus Controller
20Aggressive SC Implementations
From Adve, Pai, and Ranganathan (Proc IEEE,
(87)3, March 1999, p.448) If the accessed
location does not change its value until the Read
could have been non-speculatively issued, then
the speculation is successful. Otherwise,
roll-back speculation until incorrect load.
(Similar schemes used in HP PA-8000, Intel
Pentium Pro, MIPS R10K)
Write (4, 66) MISSES Read (2, 22) HITS
Write (2, 55) MISSES Read (4, 11) HITS
Snoops are Write(4,66) Write(2,55)
Snoops are Write(4,66) Write(2,55)
CPU 1
CPU n
Memory and Bus Controller
One way to implement this If bus-snoop for
Write(4,..) arrives before that for Write(2,..),
the Read(4, 11) is invalidated and it
reissues
21Unexpected Interactions SC and Write Update
Protocols (from Grahn, Stenstrom, Dubois)
- An important aspect of Sequential Consistency is
Write Atomicity - Write-Invalidate protocols can easily guarantee
Write Atomicity - However, Write-Update protocols are often
recommended (Read-latency) - Ensuring Write-Atomicity in Write-Update
Protocols is tricky - WEAK MEMORY MODELS TO THE RESCUE !
- Dont care about Write Atomicity except at
Acquire / Release points
Intra-cluster protocols
Chip-level protocols
dir
dir
Inter-cluster protocols
mem
mem
22A Deeper Look at Coherence
Complexity of Checking Coherence of Executions is
in NPC
Cantins proof Reduction from SAT
Existence of a Coherent Schedule is tested
Example Consider (u1 \/ u2) /\ (u1 \/
u2) Create the following concurrent
processes h1 h2 h_u1
h_u1 h_u2 h_u2 h3 ---
--- ----- -------
----- ------- --- W(d_u1)
W(d_u1) R(d_u1) R(d_u1) R(d_u2)
R(d_u2) R(d_c1) W(d_u2)
W(d_u2) R(d_u1) R(d_u1) R(d_u2)
R(d_u2) R(d_c2)
W(d_c1) W(d_c2)
W(d_c1) W(d_u1)
W(d_c2) W(d_u2)
W(d_u1)
W(d_u2)
W(d_F)
Literal Gadget
Clause Gadget
23A Deeper Look at Coherence
- Memory models that relax coherence and how
useful they are - PRAM (pipelined RAM Lipton and Sandberg) is of
academic interest
P1
P2
Pn
One memory per processor Program order is
obeyed, but No Write-Atomicity
24A Deeper Look at Coherence
- Memory models that relax coherence and how
useful they are - PRAM of academic interest
- Location consistency
- Proposed by Gao and Sarkar
- They tout its advantages in terms of scalability
- They describe an LC protocol machine
- Analysis by Wallace et.al (PDPTA 2002
1542-1550) - Shown that this LC machine is stronger than
the LC definition - Question whether LC programs indeed appear
to execute - with sequentially consistent outcomes
assuming that they are - properly labeled
-
- I have not seen many pubs on LC of late
25Classical Weak Memory Models
- Processor Consistency is widely known
- Good discussions in Ahamad et.al.,
- The Power of Processor Consistency
- First understand PRAM
- - For each processor p, there is a legal
serialization S_p of - H_pw such that if o1 and o2 are in H_pw and
o1 po-gt o2 - then o1 s_p ? o2
- For PC_g, we add the following condition
- for any two processors p and q, and for any
location x, - S_p (w,x) S_q (w,x)
- Processor Consistency according to Goodman
(PC_g) - is not the same as
- PC_d processor consistency according to
the DASH project
26Execution thats PRAM and Coherent but not
PC_g
P w(x,0)
w(y,0) Q
r(y,0) w(x,1)
R r(x,1)
r(x,0)
Coherent! Just look at each color
separately Not PC_g Construct a history
per processor with all of the processors
actions and all of others writes in that
history PC_g requires the write-histories to
agree per variable but in our example,
History of Q w(x,0) w(x,1) while
History of R w(x,1) w(x,0)
27The power of Processor Consistency
- Can handle Peterson (Ahamad)
- Cant handle Bakery (Ahamad)
- What else? (Kawash and Higham, Bounds for
mutual - exclusion with only Processor Consistency)
-
- - Peterson is correct for PC-G (a
multi-writer protocol) - - Bakery is incorrect for PC-G (a
single-writer protocol) - - Kawash and Higham prove that for mutual
exclusion under - PC-G, one multi-writer and n single-writers
are necessary
28Observations
- Weak shared memory consistency models allow
consistency - protocols to be efficient
- Unfortunately programmers find weak models
non-intuitive - How can we have the best of both worlds
- weak models to be supported by the hardware
- strong models to be presented by the software
- This can be achieved through compilers that
insert the minimal number of fence instructions
to give the appearance of SC
29Basics of Fence Insertion
- Widely cited work is by Shasha and Snir
- Recent work by Lee, Midkiff, and Padua extends
the above - Let us go through some examples (initially all
mem. locations are 0)
P1 P2 ----
---- write(x,1) read(y,
yd) write(y,1) read(x, xd)
Under SC, If yd 1, then
xd 1
30Basics of Fence Insertion
P1 P2 ----
---- write(x,1) read(y,
yd) write(y,1) read(x, xd)
- BUT if we allow instructions to re-order, then
the guarantee - If yd 1, then xd 1
- is lost !!
- But often we CAN re-order without noticing an SC
violation - When can we re-order ??
-
31Basics of Fence Insertion
- Widely cited work is by Shasha and Snir (our
exs. from their paper) - Recent work by Lee, Midkiff, and Padua extends
the above - Let us go through some examples (initially all
mem. locations are 0)
P1 P2 ----
---- write(x,1) read(y,
yd) write(y,1) read(x, xd)
a
b
- Which program order edges in P a,b must be
respected - in order to guarantee SC-compliant executions ?
- Preserving a alone Insufficient, as it can
return xd0, yd1 - Preserving b alone Insufficient, as it can
return xd0, yd1 - BOTH a and b need to be preserved how to
compute this in general? - Terminology a,b in this example forms the
Delay Set, D
32Analysis is based on Critical Cycles
- Locate all critical cycles in the concurrent
program - Equate Delay Set D to all the program-order
edges in all - critical cycles
- Locating Critical Cycles
- Locate all Conflict Edges C
- . Locate two accesses that are concurrent and one
of them is - a write these give the undirected Conflict
Edges C - . A critical cycle is a cycle in P U C that
has the following - properties
- Contains at-most two operations from the
same thread - that are consecutive in it
- Contains 0, 2, or 3 accesses to each shared
variable - that are consecutive in it (further
properties omitted)
33Finding Critical Cycles Example 1
P1 P2 ----
---- write(x,1) read(y,
yd) write(y,1) read(x, xd)
Program Order Edges P
Conflict Edges C
P1 P2 ----
---- write(x,1) read(y,
yd) write(y,1) read(x, xd)
Critical Cycle
Delay Set D all the P edges in Critical
Cycle P in our case
34Finding Critical Cycles Example 2
P1 P2 ----
---- read(x, xd)
write(x,1) read(y, yd) write(y,1)
Basically a while loop
Conflict Edges
P1 P2 ----
---- read(x, xd)
write(x,1) read(y, yd) write(y,1)
Critical Cycle
b
c
a
Delay Set D b, c whereas P a, b, c
35Finding Critical Cycles Example 3
a1 read A b1 read B c1 read C d1
read D
a2 write B b2 write C c2 write D d2
write A
D (a1,b1), (a1,c1), (a1,d1), (a2,d2),
(b2,d2), (c2,d2) suffices to ensure SC
! I.e., a1 is an acquire-read and d2 is a
release-write !!
36Basic Approach to Fence Insertion
- Goal Discover the minimal set of fences to be
inserted into - a concurrent shared memory program
- Suppose D is the delay-set discovered by the
previous analysis - Suppose the underlying (weak) architecture
supports orderings - D_o
- Let D_m be the fences to be inserted to get the
effect of D - D_m ( ( D U D_o ) )tr - D_o
- where tr is the transitive reduction
a
- Required Delay Set (a,b), (b,c), (a,d)
- D_o (c,d)
- ( (D U D_o ) )tr (a,b), (b,c), (c,d)
- ( (D U D_o) )tr D_o (a,b), (b,c) -
fences needed only here
b
c
d
37Basic Approach to Fence Insertion
- Required Delay Set (a,b), (b,c), (a,d)
- D_o (c,d)
- ( (D U D_o ) )tr (a,b), (b,c), (c,d)
- ( (D U D_o) )tr D_o (a,b), (b,c) -
fences needed only here
So, in a nutshell, .
a
a
fence
b
b
implements the desired delay-set
fence
c
d
c
d
Hardware-provided ordering
38Deriving Fences from Correctness Proofs
Lamports paper How to make a Correct
Multiprocess Program
Execute Correctly on a Multiprocessor,
IEEE Trans Computer 46(7)
1997 provides a really good insight on
deriving required weak orderings thru proofs
- Notations
- A ? B Every event in A precedes every event
in B - A -- gt B Some event in A precedes some event
in B -
Implies
Implies
39Deriving Places to insert a Synch Instruction
There is a proof in Lamports paper that
with just these Synch instructions, mutual
exclusion is guaranteed.
Repeat forever noncritical section L x_i
true For j 1 until i-1 Do if x_j
then x_I false
while x_j do od
goto L fi oD For j
i1 until N do while x_j do od od
critical section x_j false End Repeat
Synch
Synch
Synch
40Part 2 A Detailed Look at a Practical Weak
Memory Model Itanium (I do mention three others
briefly)
41Well, lets look at the big picture first
- Sparc TSO, PSO, RMO
- Reads and Writes follow the
- TSO, PSO, or RMO semantics
- Additional Fence instructions
- and others (e.g. semaphores)
- Im not upto speed on these
- Alpha
- Reads (only coherence)
- Writes (only coherence)
- Load-Locked
- Store-Conditional
- Membar
42Well, lets look at the big picture
- Power-4
- Reads and Writes (dont know much)
- Sync (Synchronize)
- Lwsync (Lightweight Sync new in Power4)
- E I E I O (Enforce In-Order Execution of I/O)
- Lwarx (Load word and reserve)
- Ldarx (Load doubleword and reserve)
- Stwcx (Store word conditional)
- Stdcx (Store Doubleword Conditional)
- Isync (Instruction synchronize)
Perhaps Old-McDonald knows more
43IA-32, IA-64, AMD, ?
- Generally thought to be Processor Consistency
- Does it really help formally specify (or even
reveal the details) ? - Intel thought so
- The Itanium memory model is described next
44The Intel Itanium Processor memory model
- Has these kinds of instructions
weak load or ordinary load -- ld
strong load or acquire-load -- ld.acq
weak store or ordinary store --
st strong store or release store --
st.rel memory fence (NOT barrier!) --
mf A few semaphore-types Allows sub-word
writes, I/O spaces
We dont model these
45Itanium memory model thru examples
Ordinary store
Can freely slide in a sequential program
st x 2
Only rule is coherence
The same applies to an ordinary load
ld reg1 x
46Itanium memory model thru examples
Release store
st.rel x 2
Things before it in sequential program
order cant happen after it
Things after it in sequential program Order may
happen before it !!
47Itanium memory model thru examples
Acquire load
ld.acq r3 y
Things before it in sequential program order may
happen after it
Things after it in sequential program Order cant
happen before it !!
48But with these rules alone, we cant explain
the following legal outcome in Itanium
st.rel y 1
st.rel x 2
Data dep.
ld.acq r4 x lt2gt
ld.acq r3 y lt1gt
ld.acq rule
ld reg1 x lt0gt
ld reg2 y lt0gt
Itanium specification DOES NOT try to explain
outcomes in terms of shuffles of the original
instructions!
49Itanium rules explain execution outcomes in
terms of progenies of stores and loads
This has turned out to be an unspoken convention
in this area for other memory models also
A store generates (n1) progenies
Other instructions generate only one
st y 1
ld.acq r3 y
Local copy for P0
remote copy for P0
remote copy for P1
50We wrote such a breeding assembler
P1 St a,1 Ld r1,a lt1gt St
b,r1 lt1gt
P2 Ld.acq r2,b lt1gt Ld r3,a lt0gt
Tuple 1
id0 proc0 pc0 op St var0 data1
wrID0 wrTypeLocal wrProc0 reg-1
useRegfalse id1 proc0 pc0 op St
var0 data1 wrID0 wrTypeRemote
wrProc0 reg-1 useRegfalse id2 proc0
pc0 op St var0 data1 wrID0
wrTypeRemote wrProc1 reg-1 useRegfalse
id3 proc0 pc1 op Ld var0 data1
wrID-1 wrTypeDontCare wrProc-1 reg0
useRegtrue id4 proc0 pc2 op St
var1 data1 wrID4 wrTypeLocal
wrProc0 reg0 useRegtrue id5 proc0
pc2 op St var1 data1 wrID4
wrTypeRemote wrProc0 reg0 useRegtrue
id6 proc0 pc2 op St var1 data1
wrID4 wrTypeRemote wrProc1 reg0
useRegtrue id7 proc1 pc0 op LdAcq
var1 data1 wrID-1 wrTypeDontCare
wrProc-1 reg1 useRegtrue id8 proc1
pc1 op Ld var0 data0 wrID-1
wrTypeDontCare wrProc-1 reg2 useRegtrue
...
Tuple 9
51Itanium rules specify how to line-up the
tuples to explain the load-outcomes !!
P0
P1
st y 1
st x 2
ld.acq r3 y lt1gt
ld.acq r4 x lt2gt
ld reg1 x lt0gt
ld reg2 y lt0gt
st y 1 l
st x 2 l
st x 2 rp0
st y 1 rp0
st x 2 rp1
st y 1 rp1
Now, arrange the split copies
st y 1 l
Explanation
ld.acq r3 y lt1gt
Dependencies
st x 2 l
ld.acq r4 x lt2gt
st y 1 rp0
st x 2 rp1
ld reg1 x lt0gt
st x 2 rp0
Anti- dependencies
ld reg2 y lt0gt
st y 1 rp1
52Gist of our method Illustration on SC and of
Itanium
The tuples to be ordered
The tuples to be ordered
legalItanium(exec) Exists order. (
requireStrictTotalOrder exec order
/\ requireWriteOperationOrder exec
order /\ requireItProgramOrder
exec order /\ requireMemoryDataDependence exec
order /\ requireDataFlowDependence exec
order /\ requireCoherence
exec order /\ requireAtomicWBRelease
exec order /\ requireSequentialUC
exec order /\ requireNoUCBypass
exec order /\ requireReadValue
exec order
SC(exec) Exists order. ( requireStrictTotalO
rder exec order /\ requireProgramOrder
exec order /\ requireReadValue
exec order
Find an arrangement under SC constraints
Find arrangement as per above constraints
53Our Itanium Formal Model (extracted from
Intel Documents written as a HOL Theory)
legal_itanium exec ( a given execution )
?order. requireStrictTotalOrder exec order
/\ requireWriteOperationOrder exec order
/\ requireProgramOrder exec order
/\ requireMemoryDataDependence exec order
/\ requireDataFlowDependence exec order
/\ requireCoherence exec order
/\ requireReadValue exec order
/\ requireAtomicWBRelease exec order
/\ requireSequentialUC exec order
/\ requireNoUCBypass exec order
See Charme03, IPDPS04, CAV04 Various
contributions by Yue Yang, Gopalakrishnan,
Lindstrom, Slind, Sivaraj, Yu Yang
54 requireStrictTotalOrder exec order
55 requireWriteOperationOrder exec order
Local Write before Local Global Write Local
Write before Remote Global Writes
56 requireProgramOrder exec order
Program Order is defined solely through
Acquires, Releases,
and Fences
57 requireMemoryDataDependence exec order
Order two accesses (Read or Write) under these
conditions IF program-ordered AND the
same variable AND Write is local
and RAW (and Read of course is local)
OR Write is local and WAR OR Both
writes are local and WAW OR Both
writes are remote and WAW and Fall in same
processor
58 requireDataFlowDependence exec order
Data Dependence Thru the Register-Space
59 requireCoherence exec order
Just Plain-Old Coherence but for TWO WRITES
falling in the WB or UC space and for EITHER
Two Local Writes OR two Remote
Writes in the same processor
60 requireReadValue exec order
Reads return Most Recent Writes
61 requireAtomicWBRelease exec order
All Remote Events Stemming from the Same
Release-Write Instruction appear to be an Atomic
Set
62 requireSequentialUC exec order
In the UC Space, Program-Ordered UC Read and
Write Events, both of which are Local are
ordered as per program order (the two
operations in question could be RR, RW, WR, or WW)
63 requireNoUCBypass exec order
UC-space Operations Do Not Exhibit Read
Bypassing as in TSO
64A MEMORY MODEL RULE IN HOL
requireCoherence exec order !i j. i IN exec
/\ j IN exec gt isWr i /\ isWr j /\ (i.var
j.var) /\ order i j /\
((attr_of i.var WB) \/ (attr_of
i.var UC)) /\ ((i.wrTypeLocal)
/\ (j.wrTypeLocal) /\
(i.procj.proc) \/
(i.wrTypeRemote) /\ (j.wrTypeRemote) /\
(i.wrProcj.wrProc))
gt !p q. p IN exec /\ q IN exec gt
isWr p /\ isWr q /\
(p.wrID i.wrID) /\ (q.wrID j.wrID) /\
(p.wrType Remote) /\ (q.wrType
Remote) /\(p.wrProc q.wrProc)
gt order p q
65One use we have put our Spec to Post-Si
Verification of MP Systems
How do we know that the actual silicon matches
the shared memory model ?
?
! X . X in exec ? ? Y . Y in exec ? . ?
! /\ \/ .
- Pray
- Run tests and manually check results
- ? What else ?
66FORMALLY VERIFY interesting EXECUTIONS
st8 12ca20 7f869af546f2f14c ld8 r25 45180
lt87b5e547172644a8gt ld2 r26 2c2a2c lt44a8gt ld2
r27 45aa2a ltc58egt
P1s exec
st8 45180 87b5e547172644a8 ld8 r25 45180
lt87b5e547172644a8gt st2 2c2a2c 44a8 st2
45aa2a c58e
P2s exec
67TWO APPROACHES - explicitly QB - implicitly
QB
Given Execution
(Prototyped this but definitely need to
re-code this)
QBF
BOOLIFY
SPEC OF MEMORY MODEL IN hol
CONVERT TO EXECUTION CHECKER PROGRAM
SAT PROBLEM
PROGRAM
Given Execution
68The alternative is to produce a manual proof
Even this simple Litmus Test has a 1-page
detailed proof
P st x 1 mf ld r1 y lt0gt
R ld . acq r2 y lt1gt ld r3 x
lt0gt
Q st . rel y 1
Atomicity of st.rel
Load of initial value is before store of every
other value
69The MPEC Tool Flow
MP execution to be verified
Mechanical Program Derivation (to be automated)
Itanium Ordering rules in HOL
Checker Program
R ld.acq r2 y lt1gt ld r3 x
lt0gt
P st x 1 mf ld r1 y lt0gt
Q st.rel y 1
Satisfiability Problem with Clauses
carrying annotations
Sat Solver
RECENT WORK
Sat
Unsat
Unsat Core Extraction using Zcore
Explanation in the form of one possible interleavi
ng
- Find Offending Clauses
- Trace their annotations
- Determine ordering cycle
70Largest example tried to date (courtesy S.
Zeisset, Intel)
Proc 2 ld4 r24 733a74
lt415e304gt st4.rel 175984 96ab4e1f 67 more
instructions ld8 r87 56460
ltb5c113d7ce4783b1gt
Proc 1 st8 12ca20 7f869af546f2f14c ld r25
45180 lt87b5e547172644a8gt 58 more
instructions st2 7c2a00 4bca
- Initially the tool gave a trivial violation
- Diagnosed to be forgotten memory initialization
- Added method to incorporate memory
initialization in our tool - Our tool found the exact same cycle as pointed
out by author of test
Cycle found thru our tool st.rel (line 18,
P1) ? ld (line 22, P2) ? mf ? ld (line 30, P2) ?
st (line 11, P1)
71Statistics Pertaining to Case Study
- 140 total instructions
- All runs were on a 1.733 GHz 1GB Redhat
Linux V9 Athlon - 1 minutes to generate Sat instance
- 9M clauses ( O(n3) in terms of
instructions ) -
- 117,823 variables ( not a problem )
- 1 minute to run Sat (unsat here) 0.2 sec to
do real work - Zcore runs fast gave 23 clauses in one
iteration
72Overview of MPEC
- Example of how a HOL rule was turned into a SAT
generator
- How the SAT part was done
Throwing an efficient transitivity blanket
over a problem to cover it with whatever
transitivity it begs for !!
- What more to expect
- Related work
73Gist of constraints
- Some arrangements are statically known
Implies
and
- Some must form an atomic set
Everybody else Strictly before or Strictly after.
- Find a strict total order satisfying all
the above !
74Gist of constraint ENCODING
j
1
N
1
1
- Use Boolean precedence matrix
- Capture i before j by m_ij
1
i
1
N
Statically known
? Unit clauses
? Boolean formula
Implies
and
Atomic set
? See how SAT-generator is derived
- Spew out irreflexivity and totality axioms
-
- Then throw a transitivity blanket
- on top of all tuples
Strict total order
75Other Approaches Tried
- Small Domain method (n logn encoding)
- Generates fantastically hard SAT problems!
- Chokes many SAT solvers Zchaff-II can handle
it well - Incremental SAT (see CAV04)
- QBF version initial prototype needs lots of
work - can serve to provide good QBF benchmarks..
76Approaches to transitivity blanket
Naïve For all tuples i, j, and k, generate
m_ij /\ m_jk ? m_jk Too many
clauses (1B for a 1000-tuple program) Better
Obtain transitive-closure of known orderings
and then prune irrelevant parts of
the blanket
E.g., if m_ij is known, dont generate
m_ij /\ ? as well as
/\ m_ij ?
77Obtaining SAT-generator from HOL
atomicWBRelease(exec,order) forall (i
in exec).(j in exec).(k in exec). (i.op
StRel) /\ (i.wrType Remote) /\ (attr_of i.var
WB) /\ (i.wrID k.wrID)
/\ order(i,j) /\ order(j,k) gt (j.wrID
i.wrID) atomicWBRelease(exec,order) forall
(i in exec).(j in exec).(k in exec). (i.op
StRel) /\ (i.wrType Remote) /\ (attr_of i.var
WB) /\ (i.wrID k.wrID)
/\ (j.wrID i.wrID) gt (order(i,j) /\
order(j,k)) atomicWBRelease(exec,order)
forall (i in exec). (i.op StRel) /\ (i.wrType
Remote) /\ (attr_of i.var WB)
gt forall (k in exec).
(i.wrID k.wrID)
gt forall (j in exec).
(j.wrID i.wrID)
gt
(order(i,j) /\ order(j,k))
Initial Spec
Applying Contrapositive
After Reducing quantifier Scopes
78Obtaining SAT-generator from HOL
atomicWBRelease(exec,order) forall (i in
exec). (i.op StRel) /\ (i.wrType Remote) /\
(attr_of i.var WB)
gt forall (k in exec). (i.wrID
k.wrID)
gt forall (j in exec). (j.wrID
i.wrID)
gt
(order(i,j) /\ order(j,k)) atomicWBRelease(exec
) forall(i,exec,wb(i)) wb(i) if
((attr_of i.varWB) (i.opStRel)
(i.wrTypeRemote) then true
else forall(k,exec,wb1(i,k)) wb1(i,k) if
(i.wrIDk.wrID)
then true
else forall(j,exec,wb2(i,k,j)) wb2(i,k,j)
if (j.wrIDi.wrID)
then true
else (order(i,j) order(j,k))
forall(i,S, e(i)) for all i in S
e(i) ( foldr( map (fn i -gt e(i)) (S)
(), true) )
Transformed Spec
Functional Program that generates the constraints
(will be automated)
79Clause annotations for the unsat core for example
op1 11 op2 -1 op3 -1 op4 -1 rule
ReadValue op1 11 op2 -1 op3 -1 op4 -1
rule ReadValue op1 11 op2 -1 op3 -1
op4 -1 rule ReadValue op1 11 op2 10
op3 -1 op4 -1 rule ReadValue op1 -1
op2 -1 op3 -1 op4 -1 rule NoRule op1
12 op2 -1 op3 -1 op4 -1 rule
ReadValue op1 12 op2 -1 op3 -1 op4 -1
rule ReadValue op1 12 op2 -1 op3 -1
op4 -1 rule ReadValue op1 12 op2 -1
op3 -1 op4 -1 rule ReadValue op1 12
op2 4 op3 -1 op4 -1 rule ReadValue op1
12 op2 -1 op3 -1 op4 -1 rule
ReadValue op1 -1 op2 -1 op3 -1 op4 -1
rule NoRule op1 10 op2 12 op3 -1 op4
-1 rule AtomicWBRelease op1 10 op2 11
op3 -1 op4 -1 rule AtomicWBRelease op1
10 op2 11 op3 10 op4 -1 rule
AtomicWBRelease op1 10 op2 11 op3 9 op4
-1 rule AtomicWBRelease op1 10 op2 11
op3 8 op4 -1 rule AtomicWBRelease op1
10 op2 11 op3 8 op4 -1 rule
AtomicWBRelease op1 10 op2 11 op3 8 op4
-1 rule AtomicWBRelease op1 10 op2 11
op3 8 op4 -1 rule AtomicWBRelease
op1 1 op2 -1 op3 -1 op4 -1 rule
Reflexive op1 4 op2 5 op3 6 op4 -1
rule TransitiveOrder op1 4 op2 5 op3
-1 op4 -1 rule ProgramOrder op1 4 op2
6 op3 8 op4 -1 rule TransitiveOrder op1
4 op2 11 op3 12 op4 -1 rule
TransitiveOrder op1 5 op2 6 op3 -1 op4
-1 rule ProgramOrder op1 6 op2 8 op3
-1 op4 -1 rule TotalOrder op1 10 op2
11 op3 -1 op4 -1 rule TotalOrder op1
11 op2 4 op3 8 op4 -1 rule
TransitiveOrder op1 11 op2 4 op3 -1 op4
-1 rule TotalOrder op1 11 op2 12 op3
-1 op4 -1 rule ProgramOrder op1 -1 op2
-1 op3 -1 op4 -1 rule NoRule op1 6
op2 -1 op3 -1 op4 -1 rule
ReadValue op1 6 op2 -1 op3 -1 op4 -1
rule ReadValue op1 6 op2 -1 op3 -1 op4
-1 rule ReadValue op1 6 op2 -1 op3
-1 op4 -1 rule ReadValue op1 6 op2 8
op3 -1 op4 -1 rule ReadValue op1 6 op2
-1 op3 -1 op4 -1 rule ReadValue op1
-1 op2 -1 op3 -1 op4 -1 rule
NoRule op1 11 op2 -1 op3 -1 op4 -1
rule ReadValue op1 11 op2 10 op3 -1
op4 -1 rule ReadValue
80Building an Error-trail for UNSAT (infeasible
executions)
denotes an op
1 2 3 4
st x 1
5
mf
Denotes op numbers. Store has both local and
remote exec
6
ld r1 y lt0gt
7 8 9 10
st.rel y 1
ld.acq r2 y lt1gt
11
12
ld r3 x lt0gt
81Building an Error-trail
1 2 3 4
st x 1
op1 4 op2 5 op3 -1 op4 -1 rule
ProgramOrder
5
mf
6
ld r1 y lt0gt
7 8 9 10
st.rel y 1
ld.acq r2 y lt1gt
11
12
ld r3 x lt0gt
82Building an Error-trail
1 2 3 4
st x 1
5
mf
op1 5 op2 6 op3 -1 op4 -1 rule
ProgramOrder
6
ld r1 y lt0gt
7 8 9 10
st.rel y 1
ld.acq r2 y lt1gt
11
12
ld r3 x lt0gt
83Building an Error-trail
1 2 3 4
st x 1
op1 6 op2 -1 op3 -1 op4 -1 rule
ReadValue op1 6 op2 -1 op3 -1 op4 -1
rule ReadValue op1 6 op2 -1 op3 -1
op4 -1 rule R eadValue op1 6 op2 -1
op3 -1 op4 -1 rule ReadValue op1 6
op2 8 op3 -1 op4 -1 rule
ReadValue op1 6 op2 -1 op3 -1 op4 -1
rule ReadValue
5
mf
6
ld r1 y lt0gt
7 8 9 10
st.rel y 1
ld.acq r2 y lt1gt
11
12
ld r3 x lt0gt
84Building an Error-trail
1 2 3 4
op1 10 op2 12 op3 -1 op4 -1 rule
AtomicWBRelease op1 10 op2 11 op3 -1 op4
-1 rule AtomicWBRelease op1 10 op2 11
op3 10 op4 -1 rule AtomicWBRelease op1
10 op2 11 op3 9 op4 -1 rule
AtomicWBRelease op1 10 op2 11 op3 8 op4
-1 rule AtomicWBRelease op1 10 op2 11
op3 8 op4 -1 rule AtomicWBRelease op1
10 op2 11 op3 8 op4 -1 rule
AtomicWBRelease op1 10 op2 11 op3 8 op4
-1 rule AtomicWBRelease
st x 1
5
mf
6
ld r1 y lt0gt
7 8 9 10
st.rel y 1
ld.acq r2 y lt1gt
11
12
ld r3 x lt0gt
85Building an Error-trail
1 2 3 4
st x 1
op1 11 op2 -1 op3 -1 op4 -1 rule
ReadValue op1 11 op2 10 op3 -1 op4 -1
rule ReadValue op1 11 op2 -1 op3 -1
op4 -1 rule ReadValue op1 11 op2 -1
op3 -1 op4 -1 rule ReadValue op1 11
op2 -1 op3 -1 op4 -1 rule
ReadValue op1 11 op2 10 op3 -1 op4 -1
rule ReadValue
5
mf
6
ld r1 y lt0gt
7 8 9 10
st.rel y 1
ld.acq r2 y lt1gt
11
12
ld r3 x lt0gt
86Building an Error-trail
1 2 3 4
st x 1
5
mf
op1 11 op2 12 op3 -1 op4 -1 rule
ProgramOrder
6
ld r1 y lt0gt
7 8 9 10
st.rel y 1
ld.acq r2 y lt1gt
11
12
ld r3 x lt0gt
87Building an Error-trail
1 2 3 4
st x 1
op1 12 op2 -1 op3 -1 op4 -1 rule
ReadValue op1 12 op2 -1 op3 -1 op4 -1
rule ReadValue op1 12 op2 -1 op3 -1
op4 -1 rule ReadValue op1 12 op2 -1
op3 -1 op4 -1 rule ReadValue op1 12
op2 4 op3 -1 op4 -1 rule
ReadValue op1 12 op2 -1 op3 -1 op4 -1
rule ReadValue
5
mf
6
ld r1 y lt0gt
7 8 9 10
st.rel y 1
ld.acq r2 y lt1gt
11
12
ld r3 x lt0gt
88MPEC (MP Execution Checker) Tool Demo
HOL Rules For Itanium In a HOL Theory File
Ganesh sitting down and coding
An MPECcable Ocaml Program
Gentuple Assembler SAT Converter Zchaff-II or
other
Printout of Cycle Revealing Error
Zcore CORE Extractor Explain Error
Explainer And DOT file Generator GhostView
SAT Result
SAT (Gives Interleaving)
UNSAT
89Other Tools Developed in UV Group
- Yue (Jason) Yangs Dissertation webpage
- Itanium Litmus-test Checker in Constraint Prolog
- NemosFinder Easily Parameterizable
Litmus-Checker Suite - in Constraint Prolog
- UMM Tool Easily Parameterizable Murphi
Operational Model - for writing Operational Specs of Memory Models
- DefectFinder Demo Prototype of Memory-model
Aware - Race Analyzer
- Now at MSR
- (www.cs.utah.edu/yyang/) -- now
jasony_at_microsoft.com
90Part 3 Whats not apparent at first
glance
91Topics
- Formal verification approaches to memory
consistency compliance - How to model the interface of the shared
memory? - Execution based
- IO mappings based
- What is wrong if an Execution based approach
is chosen ? - Finite-state realizability
- A transducer-based model of shared memory
- - Highlights of results
- Whither undecidability ?
92Formal Verification Approaches
Agreement
Imp of Shared Memory Consistency Model (a
protocol)
Spec of Shared Memory Consistency Model
- Several paper-and-pencil proofs
- Arons (pvs-based)
- McMillan (CTL model-checking based)
- Nalumasu et.al. (Test Automata based)
- Qadeer (1. Finding a serializer. 2.
Automated for simple write order) - Bingham et.al. (Window observer based)
93Other Formal Approaches
- Park, Dill, Nowatzyk
- Pong and Dubois (several papers)
- Colliers work
- Ghughals adaptation of above for weak memory
models - Chatterjee (CAV02)
- Yu, Tuttle, Lamport
- Shen, Arvind
- Ahamad, Neiger
- (Check webpage of MPV00 www.cs.utah.edu/mpv )
- Steinke and Nutt
- Gibbons, Gharachorloo
- Adve, Pugh
- (a survey will take too long)
94Modeling the Interface of Shared Memory
Spec
Imp
- Trace Based
- - Most existing works
- IO Mappings Based
- The original Lazy-caching paper (casual use)
- Kawash and Higham (defines Specs this way
-
Implementations not addressed) - Sezgin et.al. (defines Specs and Imps
Correspondence)
Read(proc, addr, data), Write(proc,addr,data),
Read_i(proc, addr), Write_i(proc,addr,data),
Spec
Imp
Read_o(proc, addr, data), Write_o(proc,addr,data),
95Whats wrong with trace-based approaches?
- Permits making statements about uninteresting or
- unrealizable machines
- Muddies exact import of the famous
undecidability result - (Alur et.al)
96Example 1 Finiteness cannot be
adequately described thru regular sets of
executions alone
Consider the set of executions w(1,a,2)
r(1,a,1) r(2,a,2) w(2,a,1) -- defines the
TEMPORAL order of events All these are
considered SC because we can build a LOGICAL
order w(1,a,2) r(2,a,2) w(2,a,1)
r(1,a,1) But how can the above TEMPORAL order
be generated by a FSM ?
P1 P2 ---
--- w(a,2) r(a,2)
r(a,2) r(a,1)
r(a,1) r(a,2)
r(a,1) w(a,1)
97Example 1 continued (take specific unravelling
of )
Temporal Order
Logical Order w(1,a,2) r(1,a,1)2N
r(2,a,2)2N w(2,a,1) w(1,a,2) r(2,a,2)2N
w(2,a,1)2N r(1,a,1)
Program fed So far
Output generated So far
w(1,a,2)
A FSM Implementation Of Seq Consistency With N
Internal States
w(1,a,2)
w(1,a,2)
A FSM Implementation Of Seq Consistency With N
Internal States
w(1,a,2) r(1,a)K, r(2,a)L
w(1,a,2) r(1,a,1)
A FSM Implementation Of Seq Consistency With N
Internal States
w(1,a,2) r(1,a)K, r(2,a)L NO w(2,a,1)
FAIL ! O/P w/o Input !!
98Example 1 continued (take specific unravelling
of )
Temporal Order
Logical Order w(1,a,2) r(1,a,1)2N
r(2,a,2)2N w(2,a,1) w(1,a,2) r(2,a,2)2N
w(2,a,1)2N r(1,a,1)
Program fed So far
Output generated So far
wo(1,a,2)
A FSM Implementation Of Seq Consistency With N
Internal States
wi(1,a,2)
wo(1,a,2)
A FSM Implementation Of Seq Consistency With N
Internal States
wi(1,a,2) ri(1,a)K, ri(2,a)L
wo(1,a,2)
A FSM Implementation Of Seq Consistency With N
Internal States
wi(1,a,2) ri(1,a)K, ri(2,a)L wi(2,a,1)
FAIL ! Too many inputs w/o output
99Example 1 continued (take specific unravelling
of )
Temporal Order
Logical Order w(1,a,2) r(1,a,1)2N
r(2,a,2)2N w(2,a,1) w(1,a,2) r(2,a,2)2N
w(2,a,1)2N r(1,a,1)
wo(1,a,2)
A FSM Implementation Of Seq Consistency With N
Internal States
wi(1,a,2) ri(1,a)K, ri(2,a)L wi(2,a,1)
FAIL ! Too many inputs w/o output
wi(1,a,2)
wi(1,a,2)
Labeled by
ri(1,a)K, ri(2,a)L
We can pump this loop, thus making it possible
to generate the SAME execution for arbitrary long
programs !!
100Restrictions in contemporary work that enables SC
verification
- Bingham, Condon, Hu
- - Require Prefix Closure (no outputs w/o
input) - e.g. the trace of length 1 r(1,a,1)
- - Rule out Prophetic Inheritance
i.e. Temporal Orders of the form
w(1,a,2) r(1,a,1)2N
r(2,a,2)2N w(2,a,1)
101Restrictions in contemporary work that enables SC
verification
- Qadeer
- Requires Simple Write Ordering
- The order of the writes to the same address
- in the temporal order and the logical order
- must be the same
- (But they provide an automated model-checking
- based verification method for this class of
SC protocols)
Temporal Order w(1,a,1) w(2,a,2)
r(3,a,2) r(4,a,1)
Required Logical Order w(2,a,2) r(3,a,2)
w(1,a,1) r(4,a,1)
lt diagram of Lazy Caching here gt
102Taxonomy of formal SC modeling approaches
- Alur et.al.
- Not Necessarily Prefix Closed (NNPC) regular
traces model the SC language - Checking containment of the (regular) language of
the Implementation is undecidable - Bingham, Condon, and Hu
- - DSC trace set (Decisive Sequential
Consistency) - Sezgins work
- Models memory systems using regular transducers
- Defines EXACTLY what finite-state realizable SC
systems are - SC verification is language containment
- Provides a semi-decision procedure for SC
verification in this setting
103Example 2 (Sezgin) The dangers of trace-based
modeling
- Imagine a memory system implementation that does
this - Accept reads and writes
- If the first P A instructions are writes,
and further - these contain exactly one write by each
processor to each address - THEN go into malevolent mode (disconnect the
shared memory) - ELSE go into benevolent mode (behave like
serial memory)
P1
P2
Pn
Benevolent Mode Connections
Malevolent Mode Connections
Single Serial Memory Unit M
M1
M2
Mn
104Example 2 (Sezgin)
Example P 1,2,3 and Aa and D
0,1,2
Benevolent Mode from now on, since the second
instrn is a read
w(1,a,2) r(3,a, 2) w(2,a,1) r(1,a, 1)
Malevolent Mode from now on, as we have pa writes
w(1,a,1) w(3,a,2) w(2,a,0) r(1,a,1)
r(2,a,0) r(3,a,2) w(1,a,2)
w(2,a,1) r(1,a,2) r(2,a,1) r(3,a,2)
LOGICAL ORDER
w(1,a,1) r(1,a,1) w(1,a,2) r(1,a,2)
w(2,a,0) r(2,a,0) w(2,a,1)
r(2,a,1) w(3,a,2) r(3,a,2) r(3,a,2)
105Whoa? Any Logical Order will do?!
TEMPORAL ORDER
w(1,a,1) w(3,a,2) w(2,a,0) r(1,a,1)
r(2,a,0) r(3,a,2) w(1,a,2)
w(2,a,1) r(1,a,2) r(2,a,1) r(3,a,2)
LOGICAL ORDER
w(1,a,1) r(1,a,1) w(1,a,2) r(1,a,2)
w(2,a,0) r(2,a,0) w(2,a,1)
r(2,a,1) w(3,a,2) r(3,a,2) r(3,a,2)
- A Logical Order had better be not fiction it
should be a possible schedule - in a could have happened sense
- Viewed from that angle, the above logical order
is nonsense because it allows - certain actions to be postponed unboundedly
- Sezgins formal definition of Implementations
builds in boundedness - BCH address an instance of this in their
past-time SC idea - Sezgins SC machines give logical order out as
Commit Order
106Status of SC undecidability
- Alur et.al. UNDECIDABLE
NNPC is -
under NNPC unrealistic - Qadeer Decidable
Simple Write Order -
under simple write order rules out some -
protocols - Bingham, Condon, and Hu Decidable under
simple These dont capture -
write order also in exactly those
that -
DSC_k are FS
realizable - Sezgins work Decidability
open Captures exactly the -
class
of FS realizable -
protocols in a
detailed manner -
(Input or programs explicitly
modeled)
107Concluding Remarks
- Importance of topic unlikely to diminish
- Platform compliance is a big deal
- High-performance OS kernel writers need to know
- Think of proving a distributed Garbage Collector
running on a Weak Memory Model (would be a great
PhD topic) - Ive omitted too many important names I cant
even remember - Partial list Adve, Gharachorloo, Pugh, Arvind,
Collier,
108Acknowledgements (sorry for omissions)
- Past students / postdoc Nalumasu, Ghughal,
Mokkedem, Hosabettu, Jones, Sivaraj, Yang, Yang,
Kuramkote - Faculty colleagues Lindstrom, Slind, Carter
- Funding agencies NSF, SRC
- Industrial Liaisons Corella, Chou, German,
Vaid, Neiger, Zeisset, Park - Other favorable influences Mathews, Tuttle, Yu,
Joshi, Dill, Pong, Nowatzyk, Lamport, Hu, Condon,
Higham, Kawash, Jackson - Who am I forgetting?