Title: Shared Memory Consistency Models : A broad survey Ganesh Gopalakrishnan* School of Computing, University of Utah, Salt Lake City, UT
1 Shared Memory Consistency Models A broad
survey Ganesh Gopalakrishnan School of
Computing, University of Utah, Salt Lake City,
UT
Past work supported in part by SRC Contract
1031.001, NSF Award 0219805 and an equipment
grant from Intel Corporation
2Shared Memory Hardware Realities
3Shared Memory Software Realities
 Must define the formal semantics of
sharedmemory concurrent  programming while allowing for all reasonable
optimizations  Defining the Shared Thread semantics for Java
(Original Java  books Chapter 17 has essentially been ripped
out)  Defining the Shared Memory Model for new
languages such as  Unified Parallel C (UPC) for Scientific
Programming  At a deeper level Must have formal basis for
Automatic  Minimal Fence Insertion to make programs appear
to execute  sequentially consistent
4Topics
 Motivations for strong and weak memory models
  How it affects consistency protocol design
  How it affects programming
 Classical memory models
  Their power
 Fence insertion during compilation
  Run on weak architectures but appear to
run SC  Overview of some weak architectures
 Itanium in a nutshell
 SATbased programs that check executions against
memory  model specs
  Demo of MP Execution Checker (MPEC) tool
for Itanium
5Topics
 Theoretical aspects of memory model
specification   Specify using Traces or Specify using
Transducers  Why Tracedbased Specification can allow one to
talk about  unrealizable machines
  Hence undecidability of sequential
consistency is not a  solved problem
 Why tracebased verification methods need to
exert some care   Otherwise can prove conniving machines to
be SC !!  A brief taxonomy of recent results in this area
  Mainly Alur et.al., Qadeer, Bingham et.al.,
and Sezgin
6Sequential Consistency The Most Basic Memory
Consistency Model
 Exists a common total order
 Respects program order
 Read sees the latest write
Example
Initially, x y 0. Finally, can r1 r2
0? Thread 1 Thread
2
x 1 r1 y
y 2 r2 x
Under Sequential Consistency No Under many weak
models Yes
7How to Think About Sequential Consistency
P1
P2
Pn
Memory
Initially, x y 0. Finally, can r1 r2
0? Thread 1 Thread
2
x 1 r1 y
y 2 r2 x
No! Not under SC ! But possible under many weak
memory models! An example of such a weak memory
model is Sparc TSO
8Coherence Perlocation Sequential Consistency
P1
P2
Pn
1address Memory
Initially, x y 0. Finally, can r1 r2
0? Thread 1 Thread
2
x 1 r1 y
y 2 r2 x
Notice that the same execution is Coherent !
9Memory Consistency Models
Defines the legal orderings of memory operations
that can be perceived at the user level
 Processors intermittently throw colors onto
 memory cells and also intermittently look at
their colors
P1
P2
Pn
Pi
Memory Cell 1
Memory Cell 2
Memory Cell n
10Memory Consistency Models
Defines the legal orderings of memory operations
that can be perceived at the user level
 Many have been developed
 Sequential Consistency (SC)
 Coherence (perlocation SC)
 Parallel Random Access Memory (PRAM)
 Causal Consistency
 Processor Consistency (PC)
 Release Consistency
 Location Consistency
 The Intel Itanim Memory Model
 Java Memory Model (JMM)
 and more!
11Memory Consistency Model Specifications
A VERY complex specification for a real
architecture (e.g. Itanium, PowerPC, ) Also
of growing concern in Software (e.g. Java
Memory Model, Unified Parallel C model, )
12Motivation for (weak) Memory Consistency models
A Hardware Perspective
 Cannot afford to do industrious updates across
large MP  systems
 Delayed and reorderable updates allow
considerable latitude  in memory consistency protocol design ? less
bugs in protocols !!
Intracluster protocols
Chiplevel protocols
dir
dir
Intercluster protocols
mem
mem
13Price Paid for Delayed Updates Bugs!
 Algorithms such as Petersons Mutual Exclusion
cease to work!  Thread 1
Thread 2  
  Flags1 BUSY
Flags2 BUSY  Turn 2
Turn 1  While (Flags2 BUSY While
(Flags1 BUSY  Turn ! 1)
Turn ! 2)  Critical section Critical
section  Flags1 FREE
FLAGS2 FREE
CAN READ OLD VALUE!!
CAN READ OLD VALUE!!
14Scope of Tutorial
 Survey of Classical Work
 Survey of Current Activities (that this speaker
is aware of)  Verification Challenges
 Theoretical Questions
 Justification for topic selection
 Complement talks on Shared Memory Consistency
Protocols  Intuitions more important than the detailzzz.
 Knowing whos who in this area helps
 Excuse for me to stick my neck out and learn
something new
15Organization
 Overview (mainly of classical works)
 Practical aspects of weak consistency models
(more depth)  Whats not apparent at first glance (still more
depth)  Conclusions and references
16Part 1 Overview of Classical Work
17Memory Serves to Plumb Data
Uniprocessor Write ( address 2 , data 33)
.. Read ( address 2 , returns data 33)
Multiprocessor P1
P2   Write (2,
33) Read (2, 33)
?
?
but respecting Coherence!
Multiprocessor P1
P2 
 Write(2, 33) Write(2, 77)
Read (2, 77) Read(2, 33)
P1
P2 P3
P4 
 
 Write (2, 33) Write
(2, 77) Read(2, 33) Read(2,
77)
Read(2, 77)
Read(2, 33)
?
?
18but Coherence is not sufficient
From Shasha and Snir, Figure 1, P. 282 (ACM
TOPLAS (10)2 1988)
Processor 1
Processor 2 
 Test_and_set1(LOCK)
Test_and_set2(LOCK) Read1(X)
Read2(X) Write1(X)
Write2(X) Reset1(L
OCK)
Reset2(LOCK)
The following memory access sequence respects
Coherence but breaks the critical section
Test_and_set1(LOCK) Read1(X) Reset1(LOCK)
Test_and_set2(LOCK) Read2(X)
Write1(X) Write2(X) Reset2(LOCK)
 Consistent view ACROSS ADDRESS SPACE is needed
 Most intuitive such Sequential Consistency !
19Basic understanding of SC
 Execute AS IF instructions in each thread were
 executed sequentially and atomically
  respecting the program order in each thread
  no constraints across sequential programs
Requires effort to achieve above effect AS WELL
AS high performance
Write (4, 66) MISSES Read (2, 22) HITS
Write (2, 55) MISSES Read (4, 11) HITS
Which Read waits ?
CPU 1
CPU n
Memory and Bus Controller
20Aggressive SC Implementations
From Adve, Pai, and Ranganathan (Proc IEEE,
(87)3, March 1999, p.448) If the accessed
location does not change its value until the Read
could have been nonspeculatively issued, then
the speculation is successful. Otherwise,
rollback speculation until incorrect load.
(Similar schemes used in HP PA8000, Intel
Pentium Pro, MIPS R10K)
Write (4, 66) MISSES Read (2, 22) HITS
Write (2, 55) MISSES Read (4, 11) HITS
Snoops are Write(4,66) Write(2,55)
Snoops are Write(4,66) Write(2,55)
CPU 1
CPU n
Memory and Bus Controller
One way to implement this If bussnoop for
Write(4,..) arrives before that for Write(2,..),
the Read(4, 11) is invalidated and it
reissues
21Unexpected Interactions SC and Write Update
Protocols (from Grahn, Stenstrom, Dubois)
 An important aspect of Sequential Consistency is
Write Atomicity  WriteInvalidate protocols can easily guarantee
Write Atomicity  However, WriteUpdate protocols are often
recommended (Readlatency)  Ensuring WriteAtomicity in WriteUpdate
Protocols is tricky  WEAK MEMORY MODELS TO THE RESCUE !
 Dont care about Write Atomicity except at
Acquire / Release points
Intracluster protocols
Chiplevel protocols
dir
dir
Intercluster protocols
mem
mem
22A Deeper Look at Coherence
Complexity of Checking Coherence of Executions is
in NPC
Cantins proof Reduction from SAT
Existence of a Coherent Schedule is tested
Example Consider (u1 \/ u2) /\ (u1 \/
u2) Create the following concurrent
processes h1 h2 h_u1
h_u1 h_u2 h_u2 h3 
  
   W(d_u1)
W(d_u1) R(d_u1) R(d_u1) R(d_u2)
R(d_u2) R(d_c1) W(d_u2)
W(d_u2) R(d_u1) R(d_u1) R(d_u2)
R(d_u2) R(d_c2)
W(d_c1) W(d_c2)
W(d_c1) W(d_u1)
W(d_c2) W(d_u2)
W(d_u1)
W(d_u2)
W(d_F)
Literal Gadget
Clause Gadget
23A Deeper Look at Coherence
 Memory models that relax coherence and how
useful they are  PRAM (pipelined RAM Lipton and Sandberg) is of
academic interest
P1
P2
Pn
One memory per processor Program order is
obeyed, but No WriteAtomicity
24A Deeper Look at Coherence
 Memory models that relax coherence and how
useful they are  PRAM of academic interest
 Location consistency
 Proposed by Gao and Sarkar
 They tout its advantages in terms of scalability
 They describe an LC protocol machine
 Analysis by Wallace et.al (PDPTA 2002
15421550)  Shown that this LC machine is stronger than
the LC definition  Question whether LC programs indeed appear
to execute  with sequentially consistent outcomes
assuming that they are  properly labeled

 I have not seen many pubs on LC of late
25Classical Weak Memory Models
 Processor Consistency is widely known
 Good discussions in Ahamad et.al.,
 The Power of Processor Consistency
 First understand PRAM
  For each processor p, there is a legal
serialization S_p of  H_pw such that if o1 and o2 are in H_pw and
o1 pogt o2  then o1 s_p ? o2
 For PC_g, we add the following condition
 for any two processors p and q, and for any
location x,  S_p (w,x) S_q (w,x)
 Processor Consistency according to Goodman
(PC_g)  is not the same as
 PC_d processor consistency according to
the DASH project
26Execution thats PRAM and Coherent but not
PC_g
P w(x,0)
w(y,0) Q
r(y,0) w(x,1)
R r(x,1)
r(x,0)
Coherent! Just look at each color
separately Not PC_g Construct a history
per processor with all of the processors
actions and all of others writes in that
history PC_g requires the writehistories to
agree per variable but in our example,
History of Q w(x,0) w(x,1) while
History of R w(x,1) w(x,0)
27The power of Processor Consistency
 Can handle Peterson (Ahamad)
 Cant handle Bakery (Ahamad)
 What else? (Kawash and Higham, Bounds for
mutual  exclusion with only Processor Consistency)

  Peterson is correct for PCG (a
multiwriter protocol)   Bakery is incorrect for PCG (a
singlewriter protocol)   Kawash and Higham prove that for mutual
exclusion under  PCG, one multiwriter and n singlewriters
are necessary
28Observations
 Weak shared memory consistency models allow
consistency  protocols to be efficient
 Unfortunately programmers find weak models
nonintuitive  How can we have the best of both worlds
 weak models to be supported by the hardware
 strong models to be presented by the software
 This can be achieved through compilers that
insert the minimal number of fence instructions
to give the appearance of SC
29Basics of Fence Insertion
 Widely cited work is by Shasha and Snir
 Recent work by Lee, Midkiff, and Padua extends
the above  Let us go through some examples (initially all
mem. locations are 0)
P1 P2 
 write(x,1) read(y,
yd) write(y,1) read(x, xd)
Under SC, If yd 1, then
xd 1
30Basics of Fence Insertion
P1 P2 
 write(x,1) read(y,
yd) write(y,1) read(x, xd)
 BUT if we allow instructions to reorder, then
the guarantee  If yd 1, then xd 1
 is lost !!
 But often we CAN reorder without noticing an SC
violation  When can we reorder ??

31Basics of Fence Insertion
 Widely cited work is by Shasha and Snir (our
exs. from their paper)  Recent work by Lee, Midkiff, and Padua extends
the above  Let us go through some examples (initially all
mem. locations are 0)
P1 P2 
 write(x,1) read(y,
yd) write(y,1) read(x, xd)
a
b
 Which program order edges in P a,b must be
respected  in order to guarantee SCcompliant executions ?
 Preserving a alone Insufficient, as it can
return xd0, yd1  Preserving b alone Insufficient, as it can
return xd0, yd1  BOTH a and b need to be preserved how to
compute this in general?  Terminology a,b in this example forms the
Delay Set, D
32Analysis is based on Critical Cycles
 Locate all critical cycles in the concurrent
program  Equate Delay Set D to all the programorder
edges in all  critical cycles
 Locating Critical Cycles
 Locate all Conflict Edges C
 . Locate two accesses that are concurrent and one
of them is  a write these give the undirected Conflict
Edges C  . A critical cycle is a cycle in P U C that
has the following  properties
 Contains atmost two operations from the
same thread  that are consecutive in it
 Contains 0, 2, or 3 accesses to each shared
variable  that are consecutive in it (further
properties omitted)
33Finding Critical Cycles Example 1
P1 P2 
 write(x,1) read(y,
yd) write(y,1) read(x, xd)
Program Order Edges P
Conflict Edges C
P1 P2 
 write(x,1) read(y,
yd) write(y,1) read(x, xd)
Critical Cycle
Delay Set D all the P edges in Critical
Cycle P in our case
34Finding Critical Cycles Example 2
P1 P2 
 read(x, xd)
write(x,1) read(y, yd) write(y,1)
Basically a while loop
Conflict Edges
P1 P2 
 read(x, xd)
write(x,1) read(y, yd) write(y,1)
Critical Cycle
b
c
a
Delay Set D b, c whereas P a, b, c
35Finding Critical Cycles Example 3
a1 read A b1 read B c1 read C d1
read D
a2 write B b2 write C c2 write D d2
write A
D (a1,b1), (a1,c1), (a1,d1), (a2,d2),
(b2,d2), (c2,d2) suffices to ensure SC
! I.e., a1 is an acquireread and d2 is a
releasewrite !!
36Basic Approach to Fence Insertion
 Goal Discover the minimal set of fences to be
inserted into  a concurrent shared memory program
 Suppose D is the delayset discovered by the
previous analysis  Suppose the underlying (weak) architecture
supports orderings  D_o
 Let D_m be the fences to be inserted to get the
effect of D  D_m ( ( D U D_o ) )tr  D_o
 where tr is the transitive reduction
a
 Required Delay Set (a,b), (b,c), (a,d)
 D_o (c,d)
 ( (D U D_o ) )tr (a,b), (b,c), (c,d)
 ( (D U D_o) )tr D_o (a,b), (b,c) 
fences needed only here
b
c
d
37Basic Approach to Fence Insertion
 Required Delay Set (a,b), (b,c), (a,d)
 D_o (c,d)
 ( (D U D_o ) )tr (a,b), (b,c), (c,d)
 ( (D U D_o) )tr D_o (a,b), (b,c) 
fences needed only here
So, in a nutshell, .
a
a
fence
b
b
implements the desired delayset
fence
c
d
c
d
Hardwareprovided ordering
38Deriving Fences from Correctness Proofs
Lamports paper How to make a Correct
Multiprocess Program
Execute Correctly on a Multiprocessor,
IEEE Trans Computer 46(7)
1997 provides a really good insight on
deriving required weak orderings thru proofs
 Notations
 A ? B Every event in A precedes every event
in B  A  gt B Some event in A precedes some event
in B 
Implies
Implies
39Deriving Places to insert a Synch Instruction
There is a proof in Lamports paper that
with just these Synch instructions, mutual
exclusion is guaranteed.
Repeat forever noncritical section L x_i
true For j 1 until i1 Do if x_j
then x_I false
while x_j do od
goto L fi oD For j
i1 until N do while x_j do od od
critical section x_j false End Repeat
Synch
Synch
Synch
40Part 2 A Detailed Look at a Practical Weak
Memory Model Itanium (I do mention three others
briefly)
41Well, lets look at the big picture first
 Sparc TSO, PSO, RMO
 Reads and Writes follow the
 TSO, PSO, or RMO semantics
 Additional Fence instructions
 and others (e.g. semaphores)
 Im not upto speed on these
 Alpha
 Reads (only coherence)
 Writes (only coherence)
 LoadLocked
 StoreConditional
 Membar
42Well, lets look at the big picture
 Power4
 Reads and Writes (dont know much)
 Sync (Synchronize)
 Lwsync (Lightweight Sync new in Power4)
 E I E I O (Enforce InOrder Execution of I/O)
 Lwarx (Load word and reserve)
 Ldarx (Load doubleword and reserve)
 Stwcx (Store word conditional)
 Stdcx (Store Doubleword Conditional)
 Isync (Instruction synchronize)
Perhaps OldMcDonald knows more
43IA32, IA64, AMD, ?
 Generally thought to be Processor Consistency
 Does it really help formally specify (or even
reveal the details) ?  Intel thought so
 The Itanium memory model is described next
44The Intel Itanium Processor memory model
 Has these kinds of instructions
weak load or ordinary load  ld
strong load or acquireload  ld.acq
weak store or ordinary store 
st strong store or release store 
st.rel memory fence (NOT barrier!) 
mf A few semaphoretypes Allows subword
writes, I/O spaces
We dont model these
45Itanium memory model thru examples
Ordinary store
Can freely slide in a sequential program
st x 2
Only rule is coherence
The same applies to an ordinary load
ld reg1 x
46Itanium memory model thru examples
Release store
st.rel x 2
Things before it in sequential program
order cant happen after it
Things after it in sequential program Order may
happen before it !!
47Itanium memory model thru examples
Acquire load
ld.acq r3 y
Things before it in sequential program order may
happen after it
Things after it in sequential program Order cant
happen before it !!
48But with these rules alone, we cant explain
the following legal outcome in Itanium
st.rel y 1
st.rel x 2
Data dep.
ld.acq r4 x lt2gt
ld.acq r3 y lt1gt
ld.acq rule
ld reg1 x lt0gt
ld reg2 y lt0gt
Itanium specification DOES NOT try to explain
outcomes in terms of shuffles of the original
instructions!
49Itanium rules explain execution outcomes in
terms of progenies of stores and loads
This has turned out to be an unspoken convention
in this area for other memory models also
A store generates (n1) progenies
Other instructions generate only one
st y 1
ld.acq r3 y
Local copy for P0
remote copy for P0
remote copy for P1
50We wrote such a breeding assembler
P1 St a,1 Ld r1,a lt1gt St
b,r1 lt1gt
P2 Ld.acq r2,b lt1gt Ld r3,a lt0gt
Tuple 1
id0 proc0 pc0 op St var0 data1
wrID0 wrTypeLocal wrProc0 reg1
useRegfalse id1 proc0 pc0 op St
var0 data1 wrID0 wrTypeRemote
wrProc0 reg1 useRegfalse id2 proc0
pc0 op St var0 data1 wrID0
wrTypeRemote wrProc1 reg1 useRegfalse
id3 proc0 pc1 op Ld var0 data1
wrID1 wrTypeDontCare wrProc1 reg0
useRegtrue id4 proc0 pc2 op St
var1 data1 wrID4 wrTypeLocal
wrProc0 reg0 useRegtrue id5 proc0
pc2 op St var1 data1 wrID4
wrTypeRemote wrProc0 reg0 useRegtrue
id6 proc0 pc2 op St var1 data1
wrID4 wrTypeRemote wrProc1 reg0
useRegtrue id7 proc1 pc0 op LdAcq
var1 data1 wrID1 wrTypeDontCare
wrProc1 reg1 useRegtrue id8 proc1
pc1 op Ld var0 data0 wrID1
wrTypeDontCare wrProc1 reg2 useRegtrue
...
Tuple 9
51Itanium rules specify how to lineup the
tuples to explain the loadoutcomes !!
P0
P1
st y 1
st x 2
ld.acq r3 y lt1gt
ld.acq r4 x lt2gt
ld reg1 x lt0gt
ld reg2 y lt0gt
st y 1 l
st x 2 l
st x 2 rp0
st y 1 rp0
st x 2 rp1
st y 1 rp1
Now, arrange the split copies
st y 1 l
Explanation
ld.acq r3 y lt1gt
Dependencies
st x 2 l
ld.acq r4 x lt2gt
st y 1 rp0
st x 2 rp1
ld reg1 x lt0gt
st x 2 rp0
Anti dependencies
ld reg2 y lt0gt
st y 1 rp1
52Gist of our method Illustration on SC and of
Itanium
The tuples to be ordered
The tuples to be ordered
legalItanium(exec) Exists order. (
requireStrictTotalOrder exec order
/\ requireWriteOperationOrder exec
order /\ requireItProgramOrder
exec order /\ requireMemoryDataDependence exec
order /\ requireDataFlowDependence exec
order /\ requireCoherence
exec order /\ requireAtomicWBRelease
exec order /\ requireSequentialUC
exec order /\ requireNoUCBypass
exec order /\ requireReadValue
exec order
SC(exec) Exists order. ( requireStrictTotalO
rder exec order /\ requireProgramOrder
exec order /\ requireReadValue
exec order
Find an arrangement under SC constraints
Find arrangement as per above constraints
53Our Itanium Formal Model (extracted from
Intel Documents written as a HOL Theory)
legal_itanium exec ( a given execution )
?order. requireStrictTotalOrder exec order
/\ requireWriteOperationOrder exec order
/\ requireProgramOrder exec order
/\ requireMemoryDataDependence exec order
/\ requireDataFlowDependence exec order
/\ requireCoherence exec order
/\ requireReadValue exec order
/\ requireAtomicWBRelease exec order
/\ requireSequentialUC exec order
/\ requireNoUCBypass exec order
See Charme03, IPDPS04, CAV04 Various
contributions by Yue Yang, Gopalakrishnan,
Lindstrom, Slind, Sivaraj, Yu Yang
54 requireStrictTotalOrder exec order
55 requireWriteOperationOrder exec order
Local Write before Local Global Write Local
Write before Remote Global Writes
56 requireProgramOrder exec order
Program Order is defined solely through
Acquires, Releases,
and Fences
57 requireMemoryDataDependence exec order
Order two accesses (Read or Write) under these
conditions IF programordered AND the
same variable AND Write is local
and RAW (and Read of course is local)
OR Write is local and WAR OR Both
writes are local and WAW OR Both
writes are remote and WAW and Fall in same
processor
58 requireDataFlowDependence exec order
Data Dependence Thru the RegisterSpace
59 requireCoherence exec order
Just PlainOld Coherence but for TWO WRITES
falling in the WB or UC space and for EITHER
Two Local Writes OR two Remote
Writes in the same processor
60 requireReadValue exec order
Reads return Most Recent Writes
61 requireAtomicWBRelease exec order
All Remote Events Stemming from the Same
ReleaseWrite Instruction appear to be an Atomic
Set
62 requireSequentialUC exec order
In the UC Space, ProgramOrdered UC Read and
Write Events, both of which are Local are
ordered as per program order (the two
operations in question could be RR, RW, WR, or WW)
63 requireNoUCBypass exec order
UCspace Operations Do Not Exhibit Read
Bypassing as in TSO
64A MEMORY MODEL RULE IN HOL
requireCoherence exec order !i j. i IN exec
/\ j IN exec gt isWr i /\ isWr j /\ (i.var
j.var) /\ order i j /\
((attr_of i.var WB) \/ (attr_of
i.var UC)) /\ ((i.wrTypeLocal)
/\ (j.wrTypeLocal) /\
(i.procj.proc) \/
(i.wrTypeRemote) /\ (j.wrTypeRemote) /\
(i.wrProcj.wrProc))
gt !p q. p IN exec /\ q IN exec gt
isWr p /\ isWr q /\
(p.wrID i.wrID) /\ (q.wrID j.wrID) /\
(p.wrType Remote) /\ (q.wrType
Remote) /\(p.wrProc q.wrProc)
gt order p q
65One use we have put our Spec to PostSi
Verification of MP Systems
How do we know that the actual silicon matches
the shared memory model ?
?
! X . X in exec ? ? Y . Y in exec ? . ?
! /\ \/ .
 Pray
 Run tests and manually check results
 ? What else ?
66FORMALLY VERIFY interesting EXECUTIONS
st8 12ca20 7f869af546f2f14c ld8 r25 45180
lt87b5e547172644a8gt ld2 r26 2c2a2c lt44a8gt ld2
r27 45aa2a ltc58egt
P1s exec
st8 45180 87b5e547172644a8 ld8 r25 45180
lt87b5e547172644a8gt st2 2c2a2c 44a8 st2
45aa2a c58e
P2s exec
67TWO APPROACHES  explicitly QB  implicitly
QB
Given Execution
(Prototyped this but definitely need to
recode this)
QBF
BOOLIFY
SPEC OF MEMORY MODEL IN hol
CONVERT TO EXECUTION CHECKER PROGRAM
SAT PROBLEM
PROGRAM
Given Execution
68The alternative is to produce a manual proof
Even this simple Litmus Test has a 1page
detailed proof
P st x 1 mf ld r1 y lt0gt
R ld . acq r2 y lt1gt ld r3 x
lt0gt
Q st . rel y 1
Atomicity of st.rel
Load of initial value is before store of every
other value
69The MPEC Tool Flow
MP execution to be verified
Mechanical Program Derivation (to be automated)
Itanium Ordering rules in HOL
Checker Program
R ld.acq r2 y lt1gt ld r3 x
lt0gt
P st x 1 mf ld r1 y lt0gt
Q st.rel y 1
Satisfiability Problem with Clauses
carrying annotations
Sat Solver
RECENT WORK
Sat
Unsat
Unsat Core Extraction using Zcore
Explanation in the form of one possible interleavi
ng
 Find Offending Clauses
 Trace their annotations
 Determine ordering cycle
70Largest example tried to date (courtesy S.
Zeisset, Intel)
Proc 2 ld4 r24 733a74
lt415e304gt st4.rel 175984 96ab4e1f 67 more
instructions ld8 r87 56460
ltb5c113d7ce4783b1gt
Proc 1 st8 12ca20 7f869af546f2f14c ld r25
45180 lt87b5e547172644a8gt 58 more
instructions st2 7c2a00 4bca
 Initially the tool gave a trivial violation
 Diagnosed to be forgotten memory initialization
 Added method to incorporate memory
initialization in our tool  Our tool found the exact same cycle as pointed
out by author of test
Cycle found thru our tool st.rel (line 18,
P1) ? ld (line 22, P2) ? mf ? ld (line 30, P2) ?
st (line 11, P1)
71Statistics Pertaining to Case Study
 140 total instructions
 All runs were on a 1.733 GHz 1GB Redhat
Linux V9 Athlon  1 minutes to generate Sat instance
 9M clauses ( O(n3) in terms of
instructions ) 
 117,823 variables ( not a problem )
 1 minute to run Sat (unsat here) 0.2 sec to
do real work  Zcore runs fast gave 23 clauses in one
iteration
72Overview of MPEC
 Example of how a HOL rule was turned into a SAT
generator
 How the SAT part was done
Throwing an efficient transitivity blanket
over a problem to cover it with whatever
transitivity it begs for !!
 What more to expect
 Related work
73Gist of constraints
 Some arrangements are statically known
Implies
and
 Some must form an atomic set
Everybody else Strictly before or Strictly after.
 Find a strict total order satisfying all
the above !
74Gist of constraint ENCODING
j
1
N
1
1
 Use Boolean precedence matrix
 Capture i before j by m_ij
1
i
1
N
Statically known
? Unit clauses
? Boolean formula
Implies
and
Atomic set
? See how SATgenerator is derived
 Spew out irreflexivity and totality axioms

 Then throw a transitivity blanket
 on top of all tuples
Strict total order
75Other Approaches Tried
 Small Domain method (n logn encoding)
 Generates fantastically hard SAT problems!
 Chokes many SAT solvers ZchaffII can handle
it well  Incremental SAT (see CAV04)
 QBF version initial prototype needs lots of
work  can serve to provide good QBF benchmarks..
76Approaches to transitivity blanket
Naïve For all tuples i, j, and k, generate
m_ij /\ m_jk ? m_jk Too many
clauses (1B for a 1000tuple program) Better
Obtain transitiveclosure of known orderings
and then prune irrelevant parts of
the blanket
E.g., if m_ij is known, dont generate
m_ij /\ ? as well as
/\ m_ij ?
77Obtaining SATgenerator from HOL
atomicWBRelease(exec,order) forall (i
in exec).(j in exec).(k in exec). (i.op
StRel) /\ (i.wrType Remote) /\ (attr_of i.var
WB) /\ (i.wrID k.wrID)
/\ order(i,j) /\ order(j,k) gt (j.wrID
i.wrID) atomicWBRelease(exec,order) forall
(i in exec).(j in exec).(k in exec). (i.op
StRel) /\ (i.wrType Remote) /\ (attr_of i.var
WB) /\ (i.wrID k.wrID)
/\ (j.wrID i.wrID) gt (order(i,j) /\
order(j,k)) atomicWBRelease(exec,order)
forall (i in exec). (i.op StRel) /\ (i.wrType
Remote) /\ (attr_of i.var WB)
gt forall (k in exec).
(i.wrID k.wrID)
gt forall (j in exec).
(j.wrID i.wrID)
gt
(order(i,j) /\ order(j,k))
Initial Spec
Applying Contrapositive
After Reducing quantifier Scopes
78Obtaining SATgenerator from HOL
atomicWBRelease(exec,order) forall (i in
exec). (i.op StRel) /\ (i.wrType Remote) /\
(attr_of i.var WB)
gt forall (k in exec). (i.wrID
k.wrID)
gt forall (j in exec). (j.wrID
i.wrID)
gt
(order(i,j) /\ order(j,k)) atomicWBRelease(exec
) forall(i,exec,wb(i)) wb(i) if
((attr_of i.varWB) (i.opStRel)
(i.wrTypeRemote) then true
else forall(k,exec,wb1(i,k)) wb1(i,k) if
(i.wrIDk.wrID)
then true
else forall(j,exec,wb2(i,k,j)) wb2(i,k,j)
if (j.wrIDi.wrID)
then true
else (order(i,j) order(j,k))
forall(i,S, e(i)) for all i in S
e(i) ( foldr( map (fn i gt e(i)) (S)
(), true) )
Transformed Spec
Functional Program that generates the constraints
(will be automated)
79Clause annotations for the unsat core for example
op1 11 op2 1 op3 1 op4 1 rule
ReadValue op1 11 op2 1 op3 1 op4 1
rule ReadValue op1 11 op2 1 op3 1
op4 1 rule ReadValue op1 11 op2 10
op3 1 op4 1 rule ReadValue op1 1
op2 1 op3 1 op4 1 rule NoRule op1
12 op2 1 op3 1 op4 1 rule
ReadValue op1 12 op2 1 op3 1 op4 1
rule ReadValue op1 12 op2 1 op3 1
op4 1 rule ReadValue op1 12 op2 1
op3 1 op4 1 rule ReadValue op1 12
op2 4 op3 1 op4 1 rule ReadValue op1
12 op2 1 op3 1 op4 1 rule
ReadValue op1 1 op2 1 op3 1 op4 1
rule NoRule op1 10 op2 12 op3 1 op4
1 rule AtomicWBRelease op1 10 op2 11
op3 1 op4 1 rule AtomicWBRelease op1
10 op2 11 op3 10 op4 1 rule
AtomicWBRelease op1 10 op2 11 op3 9 op4
1 rule AtomicWBRelease op1 10 op2 11
op3 8 op4 1 rule AtomicWBRelease op1
10 op2 11 op3 8 op4 1 rule
AtomicWBRelease op1 10 op2 11 op3 8 op4
1 rule AtomicWBRelease op1 10 op2 11
op3 8 op4 1 rule AtomicWBRelease
op1 1 op2 1 op3 1 op4 1 rule
Reflexive op1 4 op2 5 op3 6 op4 1
rule TransitiveOrder op1 4 op2 5 op3
1 op4 1 rule ProgramOrder op1 4 op2
6 op3 8 op4 1 rule TransitiveOrder op1
4 op2 11 op3 12 op4 1 rule
TransitiveOrder op1 5 op2 6 op3 1 op4
1 rule ProgramOrder op1 6 op2 8 op3
1 op4 1 rule TotalOrder op1 10 op2
11 op3 1 op4 1 rule TotalOrder op1
11 op2 4 op3 8 op4 1 rule
TransitiveOrder op1 11 op2 4 op3 1 op4
1 rule TotalOrder op1 11 op2 12 op3
1 op4 1 rule ProgramOrder op1 1 op2
1 op3 1 op4 1 rule NoRule op1 6
op2 1 op3 1 op4 1 rule
ReadValue op1 6 op2 1 op3 1 op4 1
rule ReadValue op1 6 op2 1 op3 1 op4
1 rule ReadValue op1 6 op2 1 op3
1 op4 1 rule ReadValue op1 6 op2 8
op3 1 op4 1 rule ReadValue op1 6 op2
1 op3 1 op4 1 rule ReadValue op1
1 op2 1 op3 1 op4 1 rule
NoRule op1 11 op2 1 op3 1 op4 1
rule ReadValue op1 11 op2 10 op3 1
op4 1 rule ReadValue
80Building an Errortrail for UNSAT (infeasible
executions)
denotes an op
1 2 3 4
st x 1
5
mf
Denotes op numbers. Store has both local and
remote exec
6
ld r1 y lt0gt
7 8 9 10
st.rel y 1
ld.acq r2 y lt1gt
11
12
ld r3 x lt0gt
81Building an Errortrail
1 2 3 4
st x 1
op1 4 op2 5 op3 1 op4 1 rule
ProgramOrder
5
mf
6
ld r1 y lt0gt
7 8 9 10
st.rel y 1
ld.acq r2 y lt1gt
11
12
ld r3 x lt0gt
82Building an Errortrail
1 2 3 4
st x 1
5
mf
op1 5 op2 6 op3 1 op4 1 rule
ProgramOrder
6
ld r1 y lt0gt
7 8 9 10
st.rel y 1
ld.acq r2 y lt1gt
11
12
ld r3 x lt0gt
83Building an Errortrail
1 2 3 4
st x 1
op1 6 op2 1 op3 1 op4 1 rule
ReadValue op1 6 op2 1 op3 1 op4 1
rule ReadValue op1 6 op2 1 op3 1
op4 1 rule R eadValue op1 6 op2 1
op3 1 op4 1 rule ReadValue op1 6
op2 8 op3 1 op4 1 rule
ReadValue op1 6 op2 1 op3 1 op4 1
rule ReadValue
5
mf
6
ld r1 y lt0gt
7 8 9 10
st.rel y 1
ld.acq r2 y lt1gt
11
12
ld r3 x lt0gt
84Building an Errortrail
1 2 3 4
op1 10 op2 12 op3 1 op4 1 rule
AtomicWBRelease op1 10 op2 11 op3 1 op4
1 rule AtomicWBRelease op1 10 op2 11
op3 10 op4 1 rule AtomicWBRelease op1
10 op2 11 op3 9 op4 1 rule
AtomicWBRelease op1 10 op2 11 op3 8 op4
1 rule AtomicWBRelease op1 10 op2 11
op3 8 op4 1 rule AtomicWBRelease op1
10 op2 11 op3 8 op4 1 rule
AtomicWBRelease op1 10 op2 11 op3 8 op4
1 rule AtomicWBRelease
st x 1
5
mf
6
ld r1 y lt0gt
7 8 9 10
st.rel y 1
ld.acq r2 y lt1gt
11
12
ld r3 x lt0gt
85Building an Errortrail
1 2 3 4
st x 1
op1 11 op2 1 op3 1 op4 1 rule
ReadValue op1 11 op2 10 op3 1 op4 1
rule ReadValue op1 11 op2 1 op3 1
op4 1 rule ReadValue op1 11 op2 1
op3 1 op4 1 rule ReadValue op1 11
op2 1 op3 1 op4 1 rule
ReadValue op1 11 op2 10 op3 1 op4 1
rule ReadValue
5
mf
6
ld r1 y lt0gt
7 8 9 10
st.rel y 1
ld.acq r2 y lt1gt
11
12
ld r3 x lt0gt
86Building an Errortrail
1 2 3 4
st x 1
5
mf
op1 11 op2 12 op3 1 op4 1 rule
ProgramOrder
6
ld r1 y lt0gt
7 8 9 10
st.rel y 1
ld.acq r2 y lt1gt
11
12
ld r3 x lt0gt
87Building an Errortrail
1 2 3 4
st x 1
op1 12 op2 1 op3 1 op4 1 rule
ReadValue op1 12 op2 1 op3 1 op4 1
rule ReadValue op1 12 op2 1 op3 1
op4 1 rule ReadValue op1 12 op2 1
op3 1 op4 1 rule ReadValue op1 12
op2 4 op3 1 op4 1 rule
ReadValue op1 12 op2 1 op3 1 op4 1
rule ReadValue
5
mf
6
ld r1 y lt0gt
7 8 9 10
st.rel y 1
ld.acq r2 y lt1gt
11
12
ld r3 x lt0gt
88MPEC (MP Execution Checker) Tool Demo
HOL Rules For Itanium In a HOL Theory File
Ganesh sitting down and coding
An MPECcable Ocaml Program
Gentuple Assembler SAT Converter ZchaffII or
other
Printout of Cycle Revealing Error
Zcore CORE Extractor Explain Error
Explainer And DOT file Generator GhostView
SAT Result
SAT (Gives Interleaving)
UNSAT
89Other Tools Developed in UV Group
 Yue (Jason) Yangs Dissertation webpage
 Itanium Litmustest Checker in Constraint Prolog
 NemosFinder Easily Parameterizable
LitmusChecker Suite  in Constraint Prolog
 UMM Tool Easily Parameterizable Murphi
Operational Model  for writing Operational Specs of Memory Models
 DefectFinder Demo Prototype of Memorymodel
Aware  Race Analyzer
 Now at MSR
 (www.cs.utah.edu/yyang/)  now
jasony_at_microsoft.com
90Part 3 Whats not apparent at first
glance
91Topics
 Formal verification approaches to memory
consistency compliance  How to model the interface of the shared
memory?  Execution based
 IO mappings based
 What is wrong if an Execution based approach
is chosen ?  Finitestate realizability
 A transducerbased model of shared memory
  Highlights of results
 Whither undecidability ?
92Formal Verification Approaches
Agreement
Imp of Shared Memory Consistency Model (a
protocol)
Spec of Shared Memory Consistency Model
 Several paperandpencil proofs
 Arons (pvsbased)
 McMillan (CTL modelchecking based)
 Nalumasu et.al. (Test Automata based)
 Qadeer (1. Finding a serializer. 2.
Automated for simple write order)  Bingham et.al. (Window observer based)
93Other Formal Approaches
 Park, Dill, Nowatzyk
 Pong and Dubois (several papers)
 Colliers work
 Ghughals adaptation of above for weak memory
models  Chatterjee (CAV02)
 Yu, Tuttle, Lamport
 Shen, Arvind
 Ahamad, Neiger
 (Check webpage of MPV00 www.cs.utah.edu/mpv )
 Steinke and Nutt
 Gibbons, Gharachorloo
 Adve, Pugh
 (a survey will take too long)
94Modeling the Interface of Shared Memory
Spec
Imp
 Trace Based
  Most existing works
 IO Mappings Based
 The original Lazycaching paper (casual use)
 Kawash and Higham (defines Specs this way

Implementations not addressed)  Sezgin et.al. (defines Specs and Imps
Correspondence)
Read(proc, addr, data), Write(proc,addr,data),
Read_i(proc, addr), Write_i(proc,addr,data),
Spec
Imp
Read_o(proc, addr, data), Write_o(proc,addr,data),
95Whats wrong with tracebased approaches?
 Permits making statements about uninteresting or
 unrealizable machines
 Muddies exact import of the famous
undecidability result  (Alur et.al)
96Example 1 Finiteness cannot be
adequately described thru regular sets of
executions alone
Consider the set of executions w(1,a,2)
r(1,a,1) r(2,a,2) w(2,a,1)  defines the
TEMPORAL order of events All these are
considered SC because we can build a LOGICAL
order w(1,a,2) r(2,a,2) w(2,a,1)
r(1,a,1) But how can the above TEMPORAL order
be generated by a FSM ?
P1 P2 
 w(a,2) r(a,2)
r(a,2) r(a,1)
r(a,1) r(a,2)
r(a,1) w(a,1)
97Example 1 continued (take specific unravelling
of )
Temporal Order
Logical Order w(1,a,2) r(1,a,1)2N
r(2,a,2)2N w(2,a,1) w(1,a,2) r(2,a,2)2N
w(2,a,1)2N r(1,a,1)
Program fed So far
Output generated So far
w(1,a,2)
A FSM Implementation Of Seq Consistency With N
Internal States
w(1,a,2)
w(1,a,2)
A FSM Implementation Of Seq Consistency With N
Internal States
w(1,a,2) r(1,a)K, r(2,a)L
w(1,a,2) r(1,a,1)
A FSM Implementation Of Seq Consistency With N
Internal States
w(1,a,2) r(1,a)K, r(2,a)L NO w(2,a,1)
FAIL ! O/P w/o Input !!
98Example 1 continued (take specific unravelling
of )
Temporal Order
Logical Order w(1,a,2) r(1,a,1)2N
r(2,a,2)2N w(2,a,1) w(1,a,2) r(2,a,2)2N
w(2,a,1)2N r(1,a,1)
Program fed So far
Output generated So far
wo(1,a,2)
A FSM Implementation Of Seq Consistency With N
Internal States
wi(1,a,2)
wo(1,a,2)
A FSM Implementation Of Seq Consistency With N
Internal States
wi(1,a,2) ri(1,a)K, ri(2,a)L
wo(1,a,2)
A FSM Implementation Of Seq Consistency With N
Internal States
wi(1,a,2) ri(1,a)K, ri(2,a)L wi(2,a,1)
FAIL ! Too many inputs w/o output
99Example 1 continued (take specific unravelling
of )
Temporal Order
Logical Order w(1,a,2) r(1,a,1)2N
r(2,a,2)2N w(2,a,1) w(1,a,2) r(2,a,2)2N
w(2,a,1)2N r(1,a,1)
wo(1,a,2)
A FSM Implementation Of Seq Consistency With N
Internal States
wi(1,a,2) ri(1,a)K, ri(2,a)L wi(2,a,1)
FAIL ! Too many inputs w/o output
wi(1,a,2)
wi(1,a,2)
Labeled by
ri(1,a)K, ri(2,a)L
We can pump this loop, thus making it possible
to generate the SAME execution for arbitrary long
programs !!
100Restrictions in contemporary work that enables SC
verification
 Bingham, Condon, Hu
  Require Prefix Closure (no outputs w/o
input)  e.g. the trace of length 1 r(1,a,1)
  Rule out Prophetic Inheritance
i.e. Temporal Orders of the form
w(1,a,2) r(1,a,1)2N
r(2,a,2)2N w(2,a,1)
101Restrictions in contemporary work that enables SC
verification
 Qadeer
 Requires Simple Write Ordering
 The order of the writes to the same address
 in the temporal order and the logical order
 must be the same
 (But they provide an automated modelchecking
 based verification method for this class of
SC protocols)
Temporal Order w(1,a,1) w(2,a,2)
r(3,a,2) r(4,a,1)
Required Logical Order w(2,a,2) r(3,a,2)
w(1,a,1) r(4,a,1)
lt diagram of Lazy Caching here gt
102Taxonomy of formal SC modeling approaches
 Alur et.al.
 Not Necessarily Prefix Closed (NNPC) regular
traces model the SC language  Checking containment of the (regular) language of
the Implementation is undecidable  Bingham, Condon, and Hu
  DSC trace set (Decisive Sequential
Consistency)  Sezgins work
 Models memory systems using regular transducers
 Defines EXACTLY what finitestate realizable SC
systems are  SC verification is language containment
 Provides a semidecision procedure for SC
verification in this setting
103Example 2 (Sezgin) The dangers of tracebased
modeling
 Imagine a memory system implementation that does
this  Accept reads and writes
 If the first P A instructions are writes,
and further  these contain exactly one write by each
processor to each address  THEN go into malevolent mode (disconnect the
shared memory)  ELSE go into benevolent mode (behave like
serial memory)
P1
P2
Pn
Benevolent Mode Connections
Malevolent Mode Connections
Single Serial Memory Unit M
M1
M2
Mn
104Example 2 (Sezgin)
Example P 1,2,3 and Aa and D
0,1,2
Benevolent Mode from now on, since the second
instrn is a read
w(1,a,2) r(3,a, 2) w(2,a,1) r(1,a, 1)
Malevolent Mode from now on, as we have pa writes
w(1,a,1) w(3,a,2) w(2,a,0) r(1,a,1)
r(2,a,0) r(3,a,2) w(1,a,2)
w(2,a,1) r(1,a,2) r(2,a,1) r(3,a,2)
LOGICAL ORDER
w(1,a,1) r(1,a,1) w(1,a,2) r(1,a,2)
w(2,a,0) r(2,a,0) w(2,a,1)
r(2,a,1) w(3,a,2) r(3,a,2) r(3,a,2)
105Whoa? Any Logical Order will do?!
TEMPORAL ORDER
w(1,a,1) w(3,a,2) w(2,a,0) r(1,a,1)
r(2,a,0) r(3,a,2) w(1,a,2)
w(2,a,1) r(1,a,2) r(2,a,1) r(3,a,2)
LOGICAL ORDER
w(1,a,1) r(1,a,1) w(1,a,2) r(1,a,2)
w(2,a,0) r(2,a,0) w(2,a,1)
r(2,a,1) w(3,a,2) r(3,a,2) r(3,a,2)
 A Logical Order had better be not fiction it
should be a possible schedule  in a could have happened sense
 Viewed from that angle, the above logical order
is nonsense because it allows  certain actions to be postponed unboundedly
 Sezgins formal definition of Implementations
builds in boundedness  BCH address an instance of this in their
pasttime SC idea  Sezgins SC machines give logical order out as
Commit Order
106Status of SC undecidability
 Alur et.al. UNDECIDABLE
NNPC is 
under NNPC unrealistic  Qadeer Decidable
Simple Write Order 
under simple write order rules out some 
protocols  Bingham, Condon, and Hu Decidable under
simple These dont capture 
write order also in exactly those
that 
DSC_k are FS
realizable  Sezgins work Decidability
open Captures exactly the 
class
of FS realizable 
protocols in a
detailed manner 
(Input or programs explicitly
modeled)
107Concluding Remarks
 Importance of topic unlikely to diminish
 Platform compliance is a big deal
 Highperformance OS kernel writers need to know
 Think of proving a distributed Garbage Collector
running on a Weak Memory Model (would be a great
PhD topic)  Ive omitted too many important names I cant
even remember  Partial list Adve, Gharachorloo, Pugh, Arvind,
Collier,
108Acknowledgements (sorry for omissions)
 Past students / postdoc Nalumasu, Ghughal,
Mokkedem, Hosabettu, Jones, Sivaraj, Yang, Yang,
Kuramkote  Faculty colleagues Lindstrom, Slind, Carter
 Funding agencies NSF, SRC
 Industrial Liaisons Corella, Chou, German,
Vaid, Neiger, Zeisset, Park  Other favorable influences Mathews, Tuttle, Yu,
Joshi, Dill, Pong, Nowatzyk, Lamport, Hu, Condon,
Higham, Kawash, Jackson  Who am I forgetting?