Title: Windows-NT based Distributed Virtual Parallel Machine
1Windows-NT based Distributed Virtual Parallel
Machine
The MILLIPEDE ProjectTechnion, Israel
http//www.cs.technion.ac.il/Labs/Millipede
2What is Millipede ?
A strong Virtual Parallel Machine employ
non-dedicated distributed environments
Programs
Implementation of Parallel Programming Langs
Distributed Environment
3Programming Paradigms
SPLASH
Cilk/Calipso
CC
Other
Java
CParPar
ParC
ParFortran90
Bare Millipede
Events Mechanism (MJEC) Migration Services (MGS)
Millipede Layer
Distributed Shared Memory (DSM)
Communication Packages U-Net, Transis, Horus,
Operating System Services Communication,
Threads, Page Protection, I/O
Software Packages User-mode threads
4 So, whats in a VPM? Check list Using
non-dedicated cluster of PCs (
SMPs) Multi-threaded Shared memory User-mode Stron
g support for weak memory Dynamic page- and
job-migration Load sharing for maximal locality
of reference Convergence to optimal level of
parallelism
Millipede inside
Millipede inside
5Using a non-dedicated cluster
- Dynamically identify idle machines
- Move work to idle machines
- Evacuate busy machines
- Do everything transparently
- to native user
- Co-existence of
- several parallel
- applications
6Multi-Threaded Environments
- Well known
- Better utilization of resources
- An intuitive and high level of abstraction
- Latency hiding by comp. and comm. overlap
- Natural for parallel programing paradigms
environments - Programmer defined max-level of parallelism
- Actual level of parallelism set dynamically.
Applications scale up and down - Nested parallelism
- SMPs
7Convergence to Optimal Speedup
- The Tradeoff Higher level of
parallelism VS.
Better locality of memory reference - Optimal speedup - not necessarily with the
maximal number of computers - Achieved level of parallelism - depends on the
program needs and onthe capabilities of the
system
8No/Explicit/Implicit Access Shared Memory
PVM
C-Linda
/ Receive data from master /
/ Retrieve data from DSM /
msgtype 0
rd
(
init data, ?
nproc, ?n, ?data)
pvm_recv
(-1,
msgtype)
pvm_upkint
(
nproc, 1, 1)
pvm_upkint
(
tids,
nproc, 1)
pvm_upkint
(n, 1, 1)
pvm_upkfloat
(data, n, 1)
/ Worker id is given at creation
/ Determine which slave I am
no need to compute it now /
(0..nproc-1)/
for(i0
ilt
nproc i)
if(
mytidtidsi)
mei break
/ do calculation. put result in DSM/
/ Do calculations with
data/
out
(result, id,
work(id, n, data,
nproc)
)
resultwork(me, n, data,
tids,
nproc)
/ send result to master /
pvm_initsend
(
PvmDataDefault)
Bare Millipede
pvm_pkint
(me, 1, 1)
pvm_pkfloat
(result, 1, 1)
msg_type 5
result
work(
milGetMyId(),
master
pvm_paremt
()
n, data,
pvm_send
(master,
msgtype)
milGetTotalIds())
/ Exit PVM before stopping /
pvm_exit
()
9Relaxed Consistency(Avoiding false sharing and
ping pong)
- Sequential, CRUW,
- Sync(var), Arbitrary-CW Sync
- Multiple relaxations for different shared
variables within the same program - No broadcast, no central address servers
- (so can work efficiently interconnected LANs)
- New protocols welcome (user defined?!)
- Step-by-step optimization towards maximal
parallelism
page
copies
10- LU Decomposition 1024x1024 matrix written in
SPLASH - Advantages gained when reducing
consistency of a single variable (the Global
structure)
11 MJEC - Millipede Job Event Control
An open mechanism with which various synchronizati
on methods can be implemented
- A job has a unique systemwide id
- Jobs communicate and synchronize by sending
events - Although a job is mobile, its events follow and
reach its events queue wherever it goes - Event handlers are context-sensitive
12MJEC (cont)
- Modes
- In Execution-Mode arriving events are enqueued
- In Dispatching-Mode events are dequeued and
handled by a user-supplied dispatching routine
13MJEC Interface
Execution Mode
milEnterDispatchingMode(func, context)
ret func(INIT, context)
No
Yes
retEXIT?
event pending?
ret func(event, context)
Yes
ret func(EXIT, context)
Wait for event
- Registration and Entering Dispatch Mode
- milEnterDispatchingMode((FUNC)foo, void
context) - Post Event
- milPostEvent(id target, int event, int data)
- Dispatcher Routine Syntax
- int foo(id origin, int event, int data, void
context)
14Experience with MJEC
- ParC 250 lines SPLASH 120 lines
- Easy implementation of many synchronization
methods semaphores, locks, condition variables,
barriers - Implementation of location-dependent services
(e.g., graphical display)
15Example - Barriers with MJEC
- Barrier()
- milPostEvent(BARSERV, ARR, 0)
- milEnterDispatchingMode(wait_in_barrier, 0)
-
- wait_in_barrier(src, event, context)
- if (event DEP)
- return EXIT_DISPATCHER
- else
- return STAY_IN_DISPATCHER
Barrier Server
Dispatcher ...
EVENT ARR
Job
Job
BARRIER(...)
Dispatcher
16Example - Barriers with MJEC (cont)
BarrierServer() milEnterDispatchingMode(barrier
_server, info) barrier_server(src, event,
context) if (event ARR)
enqueue(context.queue, src) if
(should_release(context))
while(context.cntgt0) milPostEvent(context.
dequeue, DEP) return
STAY_IN_DISPATCHER
Barrier Server
Dispatcher ...
EVENT DEP
EVENT DEP
Job
Job
BARRIER(...)
BARRIER(...)
Dispatcher
Dispatcher
17Dynamic Page- and Job-Migration
- Migration may occur in case of
- Remote memory access
- Load imbalance
- User comes back from lunch
- Improving locality by location rearrangement
- Sometimes migration should be disabled
- by system ping-pong, critical section
- by programmer control system
18Locality of memory reference is THE dominant
efficiency factorMigration Can Help Locality
Only Job Migration
Only Page Migration
Page Job Migration
19Load Sharing Max. Locality Minimum-Weight
multiway cut
p
p
q
q
r
r
20Problems with themultiway cut model
- NP-hard for cutsgt2. We have ngtX,000,000.
Polynomial 2-approximations known - Not optimized for load balancing
- Page replica
- Graph changes dynamically
- Only external accesses are
- recorded gt only partial
- information is available
21Our Approach
Access
page 1 page 2 page 1 page 0
- Record the history of remote accesses
- Use this information when taking decisions
concerning load balancing/load sharing - Save old information to avoid repeating bad
decisions (learn from mistakes) - Detect and solve ping-pong situations
- Do everything by piggybacking on communication
that is taking place anyway
22Ping Pong
- Detection (local)
- 1. Local threads attempt to use the page short
time after it leaves the local host - 2. The page leaves the host shortly after arrival
- Treatment (by ping-pong server)
- Collect information regarding all participating
hosts and threads - Try to locate an underloaded target host
- Stabilize the system by locking-in pages/threads
23Optimization
TSP
-
Effect of Locality
15 cities, Bare Millipede
sec
4000
NO-FS
3500
OPTIMIZED-FS
3000
FS
2500
2000
1500
1000
500
0
1
2
3
4
5
6
hosts
In the NO
-
FS case false sharing is avoided by aligning all
allocations to
page size. In the other two cases each page is
used by
2
threads in FS no
optimizations are used, and in OPTIMIZED
-
FS the history mechanism is
enabled.
24TSP on 6 hosts k number of threads falsely
sharing a page
k
optimized?
DSM
-
ping
-
pong
Number of
execution
related
treatment msgs
thread
time (sec)
messages
migrations
2
Yes
5100
290
68
645
2
No
176120
0
23
1020
3
Yes
4080
279
87
620
3
No
160460
0
32
1514
4
Yes
5060
343
99
690
4
No
155540
0
44
1515
5
Yes
6160
443
139
700
5
No
162505
0
55
1442
25Ping Pong Detection Sensitivity
26Applications
- Numerical computations Multigrid
- Model checking BDDs
- Compute-intensive graphics Ray-Tracing,
Radiosity - Games, Search trees, Pruning, Tracking, CFD ...
27Performance Evaluation
L - underloaded H - overloaded Delta(ms) - lock
in time t/o delta - polling (MGS,DSM) msg delta
- system pages delta T_epoch - max history
time ??? - remove old histories - refresh
old histories
L_epoch - histories length page
histories vs. job
histories migration heuristic -
which func? ping-pong - - what is initial
noise? - what freq. is PP?
28- LU Decomposition 1024x1024 matrix written in
SPLASH - Performance improvements when there are few
threads on each host
29LU Decomposition 2048x2048 matrix written in
SPLASH -Super-Linear speedups due to the caching
effect.
30Jacobi Relaxation 512x512 matrix (using 2
matrices, no false sharing) written in ParC
31Overhead of ParC/Millipede on a single host.
Testing with Tracking algorithm
32Info...
http//www.cs.technion.ac.il/Labs/Millipede
millipede_at_cs.technion.ac.il
Release available at the Millipede site !