Title: Parallel Programming Models: Shared Memory Programming, Intro to Message Passing, and Shared Objects
1Parallel Programming ModelsShared Memory
Programming, Intro to Message Passing,and
Shared Objects Programming in Charm
2Writing parallel programs
- Programming model
- How should a programmer view the parallel
machine? - Sequential programming von Neumann model
- Parallel programming models
- Shared memory (Shared address space) model
- Message passing model
- Shared Objects model
3Shared Address Space Model
- All memory is accessible to all processes
- Processes are mapped to processors, typically by
a symmetric OS - Coordination among processes
- by sharing variables
- Avoid stepping on toes
- using locks and barriers
4Matrix multiplication
for (i0 iltM i) for (j0 jltN j) for
(k0 kltN k) CIj AikBkj
In a shared memory style, this program is trivial
to parallelize Just have each processor deal
with a different range of I (or J?) (or
Both?)
5Programming Models 2
- Basics of Shared Address Space Programming and
Message-passing
6Shared Address Space Model
- All memory is accessible to all processes
- Processes are mapped to processors, typically by
a symmetric OS - Coordination among processes
- by sharing variables
- Avoid stepping on toes
- using locks and barriers
7Matrix multiplication
for (i0 iltM i) for (j0 jltN j) for
(k0 kltL k) Cij AikBkj
In a shared memory style, this program is trivial
to parallelize Just have each processor deal
with a different range of I (or J?) (or
Both?)
8SAS version pseudocode
size M/numPEs( ) myStart myPE( ) for
(imyStart iltmyStartsize i) for (j0 jltN
j) for (k0 kltL k) Cij
AikBkj
9Running Example computing pi
- Area of circle prr
- Ratio of the area of a circle, and that of the
enclosing square - p/4
- Method compute a set of random number pairs (in
the range 0-1) and count the number of pairs that
fall inside the circle - The ratio gives us an estimate for p/4
- In parallel Let each processor compute a
different set of random number pairs (in the
range 0-1) and count the number of pairs that
fall inside the circle
10Pi on shared memory
int count Lock countLock piFunction(int
myProcessor) seed s makeSeed(myProcessor)
for (I0 Ilt100000/P I) x random(s) y
random(s) if (xx yy lt 1.0)
lock(countLock) count unlock(countLock)
barrier() if (myProcessor 0)
printf(pif\n, 4count/100000)
11main() countLock createLock()
parallel(piFunction)
The system needs to provide the functions for
locks, barriers, and thread (or process) creation.
12Pi on shared memory efficient version
int count Lock countLock piFunction(int
myProcessor) int c seed s
makeSeed(myProcessor) for (I0 Ilt100000/P
I) x random(s) y random(s)
if (xx yy lt 1.0) c lock(countLock)
count c unlock(countLock) barrier() if
(myProcessor 0) printf(pif\n,
4count/100000)
13Real SAS systems
- Posix threads (Pthreads) is a standard for
threads-based shared memory programming - Shared memory calls just a few, normally
standard calls - In addition, lower level calls fetch-and-inc,
fetch-and-add
14Message Passing
- Assume that processors have direct access to only
their memory - Each processor typically executes the same
executable, but may be running different part of
the program at a time
15Message passing basics
- Basic calls send and recv
- send(int proc, int tag, int size, char buf)
- recv(int proc, int tag, int size, char buf)
- Recv may return the actual number of bytes
received in some systems - tag and proc may be wildcarded in a recv
- recv(ANY, ANY, 1000, buf)
- broadcast
- Other global operations (reductions)
16Posix Threads on Origin 2000
- Shared memory programming on Origin 2000
Important calls - Thread creation and joining
- pthread_create(pthread_t threadID,
At,functionName, (void ) arg) - pthread_join(pthread_t, threadID, void result)
- Locks
- pthread_mutex_t lock
- pthread_mutex_lock(lock)
- pthread_mutex_unlock(lock)
- Condition variables
- pthread_cond_t cv
- pthread_cond_init(cv, (pthread_condattr_t ) 0)
- pthread_cond_wait(cv, cv_mutex)
- pthread_cond_broadcast(cv)
- Semaphores, and other calls
17Declarations
/ pgm.c / include ltpthread.hgt include
ltstdlib.hgt include ltstdio.hgt define nThreads
4 define nSamples 1000000 typedef struct
_shared_value pthread_mutex_t lock int
value shared_value shared_value sval
18Function in each thread
void doWork(void id) size_t tid (size_t)
id int nsucc, ntrials, i ntrials
nSamples/nThreads nsucc 0
srand48((long) tid) for(i0iltntrialsi)
double x drand48() double y
drand48() if((xx yy) lt 1.0)
nsucc pthread_mutex_lock((sval.lock))
sval.value nsucc pthread_mutex_unlock((sval
.lock)) return 0
19Main function
int main(int argc, char argv) pthread_t
tidsnThreads size_t i double est
pthread_mutex_init((sval.lock), NULL)
sval.value 0 printf("Creating Threads\n")
for(i0iltnThreadsi)
pthread_create(tidsi, NULL, doWork, (void )
i) printf("Created Threads... waiting for them
to complete\n") for(i0iltnThreadsi)
pthread_join(tidsi, NULL) printf("Threads
Completed...\n") est 4.0 ((double)
sval.value / (double) nSamples)
printf("Estimated Value of PI lf\n", est)
exit(0)
20Compiling Makefile
Makefile for solaris FLAGS -mt for
Origin2000 FLAGS pgm pgm.c cc -o
pgm (FLAGS) pgm.c -lpthread clean rm
-f pgm .o
21Message Passing
- Program consists of independent processes,
- Each running in its own address space
- Processors have direct access to only their
memory - Each processor typically executes the same
executable, but may be running different part of
the program at a time - Special primitives exchange data send/receive
- Early theoretical systems
- CSP communicating sequential processes
- send and matching receive from another processor
both wait. - OCCAM on Transputers used this model
- Performance problems due to unnecessary(?) wait
- Current systems
- Send operations dont wait for receipt on remote
processor
22Message Passing
send
receive
copy
data
data
PE0
PE1
23Basic Message Passing
- We will describe a hypothetical message passing
system, - with just a few calls that define the model
- Later, we will look at real message passing
models (e.g. MPI), with a more complex sets of
calls - Basic calls
- send(int proc, int tag, int size, char buf)
- recv(int proc, int tag, int size, char buf)
- Recv may return the actual number of bytes
received in some systems - tag and proc may be wildcarded in a recv
- recv(ANY, ANY, 1000, buf)
- broadcast
- Other global operations (reductions)
24Pi with message passing
Int count, c1 main() Seed s
makeSeed(myProcessor) for (I0 Ilt100000/P
I) x random(s) y random(s)
if (xx yy lt 1.0) count send(0,1,4,
count)
25Pi with message passing
if (myProcessorNum() 0) for (I0
IltmaxProcessors() I) recv(I,1,4,
c) count c printf(pif\n,
4count/100000) / end function main /
26Collective calls
- Message passing is often, but not always, used
for SPMD style of programming - SPMD Single process multiple data
- All processors execute essentially the same
program, and same steps, but not in lockstep - All communication is almost in lockstep
- Collective calls
- global reductions (such as max or sum)
- syncBroadcast (often just called broadcast)
- syncBroadcast(whoAmI, dataSize, dataBuffer)
- whoAmI sender or receiver
27Standardization of message passing
- Historically
- nxlib (On Intel hypercubes)
- ncube variants
- PVM
- Everyone had their own variants
- MPI standard
- Vendors, ISVs, and academics got together
- with the intent of standardizing current practice
- Ended up with a large standard
- Popular, due to vendor support
- Support for
- communicators avoiding tag conflicts, ..
- Data types
- ..
28Parallel Programming tasks
- Decomposition (what to do in parallel)
- Mapping
- Scheduling (sequencing)
- Machine dependent expression
29Spectrum of parallel Languages
Leve l
MPI
Specialization
30Charm
- Data Driven Objects
- Asynchronous method invocation
- Prioritized scheduling
- Object Arrays
- Object Groups
- global object with a representative on each PE
- Information sharing abstractions
31Data Driven Execution
Objects
Scheduler
Scheduler
Message Q
Message Q
32CkChareID mainhandle mainmain(CkArgMsg m)
int i, low 0 for (i0 ilt100 i)
new CProxy_piPart() responders 100 count
0 mainhandle thishandle // readonly
initialization void mainresults(DataMsg
msg) count msg-gtcount if (0
--responders) CkPrintf("pi f \n",
4.0count/100000) CkExit()
Execution begins here
argc/argv
Exit scheduler after method returns
33piPartpiPart() // declarations..
CProxy_main mainproxy(mainhandle)
srand48((long) this) mySamples
100000/100 for (i 0 ilt mySamples i)
x drand48() y drand48() if
((xx yy) lt 1.0) localCount
DataMsg result new DataMsg result-gtcount
localCount mainproxy.results(result) delete
this
34Chares (Data driven Objects)
- Regular C classes,
- with some methods designated as remotely
invokable (called entry methods ) - entry methods have only one parameter
- of type message
- Creation of an instance of chare class C
- new CProxy_C(msg)
- Creates an instance of C on a specified processor
pe - new CProxy_C (msg, pe)
- Cproxy_C a proxy class generated by Charm for
chare class C declared by the user
35Messages
- A user-defined C class
- inherits from a system-defined class
- messages can be communicated to others as
parameters - Has regular data fields
- Declaration normal C,
- inherit from a system defined class
- Creation (just usual C)
- MsgType m new MsgType
36Remote method invocation
- Proxy Classes
- For each chare class C, the system generates a
proxy class. - (C CProxy_C)
- Each chare has a global ID (ChareID)
- Global in the sense of being valid on all
processors - thishandle (analogous to this) gets you the
ChareID - You can send thishandle in messages
- Given a handle h, you can create a proxy
- CProxy_C p(h) // or q new CProxy_C(h)
- p.method(msg) // or q-gtmethod(msg)
37Object Arrays
- A collection of chares,
- with a single global name for the collection, and
- each member addressed by an index
- Mapping of element objects to processors handled
by the system
Users view
A0
A1
A2
A3
A..
System view
A0
A3
38Object Groups
- A group of objects (chares)
- with exactly one representative on each processor
- A single Id for the group as a whole
- invoke methods in a branch (asynchronously), all
branches (broadcast), or in the local branch - creation
- groupId new Cproxy_C(msg)
- remote invocation
- CProxy_C p(groupId)
- p.methodName(msg) // p.methodName(msg, peNum)
- p.LocalBranch-gtf(.)
39Information sharing abstractions
- Observation
- Information is shared in several specific modes
in parallel programs - Other models support only a limited sets of
modes - Shared memory everything is shared sledgehammer
approach - Message passing messages are the only method
- Charm identifies and supports several modes
- Readonly / writeonce
- Tables (hash tables)
- accumulators
- Monotonic variables
40Compiling Charm programs
- Need to define an interface specification file
- mod.ci for each module mod
- Contains declarations that the system uses to
produce proxy classes - These produced classes must be included in your
mod.C file - See examples provided on the class web site.